• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (01): 28-36.

• 高性能计算 • 上一篇    下一篇

Gloo+:利用在网计算技术加速分布式深度学习训练

黄泽彪,董德尊,齐星云   

  1. (国防科技大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2022-10-27 修回日期:2023-03-20 接受日期:2024-01-25 出版日期:2024-01-25 发布日期:2024-01-15
  • 基金资助:
    国家重点研发计划(2022YFB4501702)

Gloo+: Accelerating distributed training of deep learning using in-network computing

HUANG Ze-biao,DONG De-zun,QI Xing-yun   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2022-10-27 Revised:2023-03-20 Accepted:2024-01-25 Online:2024-01-25 Published:2024-01-15

摘要: 在分布式深度学习训练中,聚合通信是主要的通信方式。在聚合通信优化的研究中,有软件层面的优化和硬件层面的优化。SHARP是Mellanox提出来的一种聚合通信网络卸载协议,是针对聚合通信在硬件上的优化,其将聚合操作卸载到网络中的交换机,进而缩短了聚合通信时间。在Gloo的基础上集成了SHARP技术,设计并实现了一个能够利用在网计算技术来加速分布式深度学习训练的聚合通信库——Gloo+。评估并比较了Gloo+、Gloo以及MPI中聚合操作的性能,并将Gloo+应用于分布式深度学习训练中,以此来检验其实战能力。对Gloo+的实验评估结果显示,在基准测试时,在消息大小较小的情况下,Gloo+相对于Gloo的加速比最高能达到100以上;相比于以太网模式下的MPI,其加速比最高也能达到50以上;相比于IB网模式下的MPI,其加速比在10以内。在分布式深度学习训练的实际应用中,Gloo+相比于Gloo加速比最高能达到1.1,相比于以太网模式下的MPI加速比最高有1.3,相比于IB网模式下的MPI加速比最高有0.5。

关键词: 分布式深度学习, 聚合通信, 在网计算, Gloo, SHARP

Abstract: In distributed deep learning training, collective communication is the main communication method. In the research of collective communication optimization, there are software-level optimization and hardware-level optimization. SHARP is a collective communication network offload protocol proposed by Mellanox. It is optimized for collective communication in hardware. It offloads collective ope- rations to switches in the network, thereby shortening the collective communication time. We integrated SHARP technology on the basis of Gloo, and designed and implemented a collective communication library-Gloo+ that can accelerate distributed deep learning training by using in-network computing. Our experimental evaluation of Gloo+ shows that in the benchmark test, when the message size is small, the acceleration ratio of Gloo+ relative to Gloo can reach up to 100  or more. While compared to MPI in Ethernet mode, the acceleration ratio can also reach up to 50  or more. While compared to MPI in IB mode, the acceleration ratio is within 10. In the practical application of distributed deep learning training, the acceleration ratio of Gloo+ can reach a maximum of 1.1 compared to Gloo, 1.3  compared to MPI in Ethernet mode, and 0.5  compared to MPI in IB mode.

Key words: distributed deep learning, collective communication, in-network computing, Gloo, SHARP