• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2021, Vol. 43 ›› Issue (03): 416-425.

Previous Articles     Next Articles

Performance analysis of distributed deep learning communication architecture

ZHANG Li-zhi,RAN Zhe-jiang,LAI Zhi-quan,LIU Feng   

  1. (Parallel and Distributed Key Laboratory of National Defense Technology,

    Colloge of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

  • Received:2020-06-11 Revised:2020-07-12 Accepted:2021-03-25 Online:2021-03-25 Published:2021-03-26

Abstract: In recent years, advances in deep learning technology have pushed artificial intelligence into a new era of development. However, massive training data and large-scale models have brought increasingly serious challenges to deep learning. Distributed deep learning is an effective method to meet this challenge. An efficient synchronization algorithm is the key to ensuring the performance of distributed deep learning. Aiming at the problem of parallel training of traditional distributed deep learning model synchronization algorithms on large-scale nodes, firstly, the principles and performance of two mainstream parameter communication architectures, centralized Parameter Server and decentralized Ring Allreduce, are analyzed. Secondly, a comparative test environment of two distributed training frameworks was constructed based on TensorFlow on the Tianhe high-performance GPU cluster. Finally, using the Parameter Server architecture as the baseline, the comparative performance of Ring Allreduce architecture for training AlexNet and ResNet-50 in a GPU cluster environment was tested. The experimental results show that, with 32 GPUs, the expansion efficiency of Ring Allreduce architecture can reach 97%. Compared with Parameter Server architecture, it increases the distributed computing performance by 30%, which verifies that Ring Allreduce architecture has better scalability.



Key words: Ring Allreduce, parameter server, distributed training, deep learning, deep neural network