• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (03): 416-425.

• 高性能计算 • 上一篇    下一篇

分布式深度学习通信架构的性能分析

张立志,冉浙江,赖志权,刘锋   

  1. (国防科技大学计算机学院并行与分布处理国防科技重点实验室,湖南 长沙 410073)
  • 收稿日期:2020-06-11 修回日期:2020-07-12 接受日期:2021-03-25 出版日期:2021-03-25 发布日期:2021-03-26
  • 基金资助:
    国家重点研发计划(2018YFB0204301);国家自然科学基金(61702533)

Performance analysis of distributed deep learning communication architecture

ZHANG Li-zhi,RAN Zhe-jiang,LAI Zhi-quan,LIU Feng   

  1. (Parallel and Distributed Key Laboratory of National Defense Technology,

    Colloge of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

  • Received:2020-06-11 Revised:2020-07-12 Accepted:2021-03-25 Online:2021-03-25 Published:2021-03-26

摘要: 近年来,深度学习技术的进步推动人工智能进入了一个新的发展时期。但是,海量的训练数据、超大规模的模型给深度学习带来了日益严峻的挑战,分布式深度学习应运而生,逐渐成为应对这一挑战的有效手段,而高效的参数通信架构是保证分布式深度学习性能的关键。针对传统分布式深度学习模型同步架构在大规模节点上并行训练的问题,首先,分析了集中式的Parameter Server和去中心化的Ring Allreduce这2种主流的参数通信架构的原理和性能。然后,在天河高性能GPU集群上基于TensorFlow构建了2种分布式训练架构的对比测试环境。最后,以Parameter Server架构为基准线,测试了Ring Allreduce架构在GPU集群环境下训练AlexNet和ResNet-50的对比性能。实验结果表明,在使用32个GPU的情况下,Ring Allreduce架构扩展效率可达97%,相比Parameter Server架构,其分布式计算性能可提升30%,验证了Ring Allreduce架构具有更好的可扩展性。


关键词: Ring Allreduce, 参数服务器, 分布式训练, 深度学习, 深度神经网络

Abstract: In recent years, advances in deep learning technology have pushed artificial intelligence into a new era of development. However, massive training data and large-scale models have brought increasingly serious challenges to deep learning. Distributed deep learning is an effective method to meet this challenge. An efficient synchronization algorithm is the key to ensuring the performance of distributed deep learning. Aiming at the problem of parallel training of traditional distributed deep learning model synchronization algorithms on large-scale nodes, firstly, the principles and performance of two mainstream parameter communication architectures, centralized Parameter Server and decentralized Ring Allreduce, are analyzed. Secondly, a comparative test environment of two distributed training frameworks was constructed based on TensorFlow on the Tianhe high-performance GPU cluster. Finally, using the Parameter Server architecture as the baseline, the comparative performance of Ring Allreduce architecture for training AlexNet and ResNet-50 in a GPU cluster environment was tested. The experimental results show that, with 32 GPUs, the expansion efficiency of Ring Allreduce architecture can reach 97%. Compared with Parameter Server architecture, it increases the distributed computing performance by 30%, which verifies that Ring Allreduce architecture has better scalability.



Key words: Ring Allreduce, parameter server, distributed training, deep learning, deep neural network