• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2024, Vol. 46 ›› Issue (01): 28-36.

• High Performance Computing • Previous Articles     Next Articles

Gloo+: Accelerating distributed training of deep learning using in-network computing

HUANG Ze-biao,DONG De-zun,QI Xing-yun   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2022-10-27 Revised:2023-03-20 Accepted:2024-01-25 Online:2024-01-25 Published:2024-01-15

Abstract: In distributed deep learning training, collective communication is the main communication method. In the research of collective communication optimization, there are software-level optimization and hardware-level optimization. SHARP is a collective communication network offload protocol proposed by Mellanox. It is optimized for collective communication in hardware. It offloads collective ope- rations to switches in the network, thereby shortening the collective communication time. We integrated SHARP technology on the basis of Gloo, and designed and implemented a collective communication library-Gloo+ that can accelerate distributed deep learning training by using in-network computing. Our experimental evaluation of Gloo+ shows that in the benchmark test, when the message size is small, the acceleration ratio of Gloo+ relative to Gloo can reach up to 100  or more. While compared to MPI in Ethernet mode, the acceleration ratio can also reach up to 50  or more. While compared to MPI in IB mode, the acceleration ratio is within 10. In the practical application of distributed deep learning training, the acceleration ratio of Gloo+ can reach a maximum of 1.1 compared to Gloo, 1.3  compared to MPI in Ethernet mode, and 0.5  compared to MPI in IB mode.

Key words: distributed deep learning, collective communication, in-network computing, Gloo, SHARP