Gloo+: Accelerating distributed training of deep learning using in-network computing

Computer Engineering & Science ›› 2024, Vol. 46 ›› Issue (01): 28-36.

• High Performance Computing • Previous Articles Next Articles

Gloo+: Accelerating distributed training of deep learning using in-network computing

HUANG Ze-biao,DONG De-zun,QI Xing-yun

（College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China）

Received:2022-10-27 Revised:2023-03-20 Accepted:2024-01-25 Online:2024-01-25 Published:2024-01-15

Abstract

Abstract: In distributed deep learning training, collective communication is the main communication method. In the research of collective communication optimization, there are software-level optimization and hardware-level optimization. SHARP is a collective communication network offload protocol proposed by Mellanox. It is optimized for collective communication in hardware. It offloads collective ope- rations to switches in the network, thereby shortening the collective communication time. We integrated SHARP technology on the basis of Gloo, and designed and implemented a collective communication library-Gloo+ that can accelerate distributed deep learning training by using in-network computing. Our experimental evaluation of Gloo+ shows that in the benchmark test, when the message size is small, the acceleration ratio of Gloo+ relative to Gloo can reach up to 100 or more. While compared to MPI in Ethernet mode, the acceleration ratio can also reach up to 50 or more. While compared to MPI in IB mode, the acceleration ratio is within 10. In the practical application of distributed deep learning training, the acceleration ratio of Gloo+ can reach a maximum of 1.1 compared to Gloo, 1.3 compared to MPI in Ethernet mode, and 0.5 compared to MPI in IB mode.

Key words: distributed deep learning, collective communication, in-network computing, Gloo, SHARP

HUANG Ze-biao, DONG De-zun, QI Xing-yun. Gloo+: Accelerating distributed training of deep learning using in-network computing[J]. Computer Engineering & Science, 2024, 46(01): 28-36.

[1]	WANG Xiao-yu, DONG De-zun. A lightweight collective communication library for distributed deep learning [J]. Computer Engineering & Science, 2022, 44(07): 1191-1198.
[2]	WANG Hao, ZHANG Wei, XIE Min, DONG Yong. Reduction operation offloading optimization based on Tianhe interconnect MPI collective [J]. Computer Engineering & Science, 2020, 42(11): 1981-1987.
[3]	LIU Lu，ZHANG Lei，XIE Min，WANG Yongqing. Design of network interface in Tianhe1A interconnection system [J]. J4, 2013, 35(2): 18-25.
[4]	. [J]. J4, 2006, 28(10): 1-3.

Gloo+: Accelerating distributed training of deep learning using in-network computing

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 4

Recommended Articles

Metrics

Comments