分布式深度学习通信架构的性能分析

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (03): 416-425.

分布式深度学习通信架构的性能分析

张立志，冉浙江，赖志权，刘锋

（国防科技大学计算机学院并行与分布处理国防科技重点实验室，湖南长沙 410073）

收稿日期:2020-06-11 修回日期:2020-07-12 接受日期:2021-03-25 出版日期:2021-03-25 发布日期:2021-03-26
基金资助:
国家重点研发计划（2018YFB0204301）;国家自然科学基金（61702533）

Performance analysis of distributed deep learning communication architecture

ZHANG Li-zhi,RAN Zhe-jiang,LAI Zhi-quan,LIU Feng

(Parallel and Distributed Key Laboratory of National Defense Technology,

Colloge of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

Received:2020-06-11 Revised:2020-07-12 Accepted:2021-03-25 Online:2021-03-25 Published:2021-03-26

摘要/Abstract

摘要： 近年来，深度学习技术的进步推动人工智能进入了一个新的发展时期。但是，海量的训练数据、超大规模的模型给深度学习带来了日益严峻的挑战,分布式深度学习应运而生，逐渐成为应对这一挑战的有效手段，而高效的参数通信架构是保证分布式深度学习性能的关键。针对传统分布式深度学习模型同步架构在大规模节点上并行训练的问题，首先，分析了集中式的Parameter Server和去中心化的Ring Allreduce这2种主流的参数通信架构的原理和性能。然后，在天河高性能GPU集群上基于TensorFlow构建了2种分布式训练架构的对比测试环境。最后，以Parameter Server架构为基准线，测试了Ring Allreduce架构在GPU集群环境下训练AlexNet和ResNet-50的对比性能。实验结果表明，在使用32个GPU的情况下，Ring Allreduce架构扩展效率可达97%，相比Parameter Server架构，其分布式计算性能可提升30%，验证了Ring Allreduce架构具有更好的可扩展性。

关键词: Ring Allreduce, 参数服务器, 分布式训练, 深度学习, 深度神经网络

Abstract: In recent years, advances in deep learning technology have pushed artificial intelligence into a new era of development. However, massive training data and large-scale models have brought increasingly serious challenges to deep learning. Distributed deep learning is an effective method to meet this challenge. An efficient synchronization algorithm is the key to ensuring the performance of distributed deep learning. Aiming at the problem of parallel training of traditional distributed deep learning model synchronization algorithms on large-scale nodes, firstly, the principles and performance of two mainstream parameter communication architectures, centralized Parameter Server and decentralized Ring Allreduce, are analyzed. Secondly, a comparative test environment of two distributed training frameworks was constructed based on TensorFlow on the Tianhe high-performance GPU cluster. Finally, using the Parameter Server architecture as the baseline, the comparative performance of Ring Allreduce architecture for training AlexNet and ResNet-50 in a GPU cluster environment was tested. The experimental results show that, with 32 GPUs, the expansion efficiency of Ring Allreduce architecture can reach 97%. Compared with Parameter Server architecture, it increases the distributed computing performance by 30%, which verifies that Ring Allreduce architecture has better scalability.

Key words: Ring Allreduce, parameter server, distributed training, deep learning, deep neural network

张立志, 冉浙江, 赖志权, 刘锋. 分布式深度学习通信架构的性能分析[J]. 计算机工程与科学, 2021, 43(03): 416-425.

ZHANG Li-zhi, RAN Zhe-jiang, LAI Zhi-quan, LIU Feng. Performance analysis of distributed deep learning communication architecture[J]. Computer Engineering & Science, 2021, 43(03): 416-425.

编辑推荐

Metrics

阅读次数

全文

401

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	401

来源	本网站	其他网站

次数	355	46
比例	89%	11%

摘要

358

最新录用	在线预览	正式出版

0	0	358

	来源	本网站

	次数	358
	比例	100%

[1]	吴玉虹, 王建. 基于Patches-CNN的模拟电路故障诊断[J]. 计算机工程与科学, 2025, 47(01): 35-44.
[2]	徐超, 阮荣耀, 陈勇, . 一种基于区块链的医疗数据审计方法[J]. 计算机工程与科学, 2025, 47(01): 95-106.
[3]	陈欣然, 刘宁, 闫中敏, 刘磊, 崔立真. 基于注意力指导的双粒度跨模态医学特征学习框架[J]. 计算机工程与科学, 2025, 47(01): 150-159.
[4]	罗婧, 叶志晟, 杨泽华, 傅天豪, 魏雄, 汪小林, 罗英伟, . 研发类GPU集群任务数据集的构建及分析[J]. 计算机工程与科学, 2024, 46(12): 2128-2137.
[5]	敬超, 闭玉申. 面向深度学习作业的干扰感知在线调度算法研究[J]. 计算机工程与科学, 2024, 46(12): 2138-2148.
[6]	赵鑫博, 陆忠华. 面向深度行情因子挖掘的分布式训练关键技术研究[J]. 计算机工程与科学, 2024, 46(09): 1554-1565.
[7]	陈磊, 梁正友, 孙宇, 蔡俊民. 多尺度特征融合的移动端单目深度估计研究[J]. 计算机工程与科学, 2024, 46(09): 1616-1524.
[8]	刘强, 李沐春, 伍晓洁, 王煜恒. S-JSMA：一种低扰动冗余的快速JSMA对抗样本生成方法[J]. 计算机工程与科学, 2024, 46(08): 1395-1402.
[9]	丁建平, 李卫军, 刘雪洋, 陈旭. 命名实体识别研究综述[J]. 计算机工程与科学, 2024, 46(07): 1296-1310.
[10]	张家豪, 邓金易, 尹首一, 魏少军, 胡杨. 基于Actor模型的众核数据流硬件架构探索[J]. 计算机工程与科学, 2024, 46(06): 959-967.
[11]	胡昭华, 王长富, . 改进Faster R-CNN的遥感图像小目标检测算法[J]. 计算机工程与科学, 2024, 46(06): 1063-1071.
[12]	谭郁松, 王伟, 蹇松雷, 易超雄. 基于异常保持的弱监督学习网络入侵检测模型[J]. 计算机工程与科学, 2024, 46(05): 801-809.
[13]	高珊, 李世杰, 蔡志平. 基于深度学习的中文文本分类综述[J]. 计算机工程与科学, 2024, 46(04): 684-692.
[14]	罗月童, 李超, 周波, 张延孔. 面向工业缺陷分类的交互式易混淆缺陷分离方法研究[J]. 计算机工程与科学, 2024, 46(03): 463-470.
[15]	吕伏, 韩晓天, 冯永安, 项梁. 基于自适应纹理特征融合的纹理图像分类方法[J]. 计算机工程与科学, 2024, 46(03): 488-498.