一种面向分布式深度学习的轻量级聚合通信库

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (07): 1191-1198.

一种面向分布式深度学习的轻量级聚合通信库

王笑雨,董德尊

（国防科技大学计算机学院,湖南长沙 410073）

收稿日期:2021-12-20 修回日期:2022-03-03 接受日期:2022-07-25 出版日期:2022-07-25 发布日期:2022-07-25
基金资助:
湖南省自然科学杰出青年基金（2021JJ10050）

A lightweight collective communication library for distributed deep learning

WANG Xiao-yu,DONG De-zun

（College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China）

Received:2021-12-20 Revised:2022-03-03 Accepted:2022-07-25 Online:2022-07-25 Published:2022-07-25

摘要/Abstract

摘要： 聚合通信操作在分布式训练中应用广泛，特别是AllReduce操作被用于同步每个节点上模型的参数。为了获得更高的精度，数据集和神经网络模型的规模越来越大，节点间的通信开销在训练过程中占比很大且已成为训练加速的瓶颈。目前已有许多针对这一场景下聚合操作的优化工作，但都聚焦于操作的合理使用而不是其本身，例如通信调度和梯度量化。事实上，聚合操作与分布式训练应用之间存在许多不相匹配的地方，比如后者不要求所有节点同时同步梯度，而前者却需要。这使得针对分布式训练中聚合通信的研究是有必要的。然而发现目前分布式训练中的通信框架结构复杂、代码量大，对开展相关工作来说是不合适的。为了解决这一问题，设计并实现了一个轻量级的聚合通信库，以方便分析和改进分布式训练中的聚合操作。它支持主流框架和网络，并且架构简洁。这便于研究人员实现自定义通信操作，并能应用到主流的实验环境中以产生较广的影响。在多种情况下分别通过纯聚合操作和分布式深度学习应用来评估所设计的聚合通信库。实验结果显示，该库可以实现与MPI相近的性能，可以作为分析和研究分布式训练中梯度同步的聚合通信库。

关键词: 分布式深度学习, 神经网络, 聚合通信, Gloo, UCX ,

Abstract: Collective communication operations are widely used in distributed training, especially AllReduce operations are used to synchronize model parameters on each node. In order to obtain higher accuracy, the scale of datasets and neural network models is getting larger and larger, and the communication overhead between nodes accounts for a large proportion in the training process and becomes a bottleneck for accelerating training. There have been many optimizations for collective operations in this scene, such as communication scheduling and gradient quantization, but they typically focus on the rational employ instead of the operations themselves. Actually, there are mismatches between the collective operations and distributed training applications. For example, the latter does not require all nodes to synchronize gradients simultaneously while the former does. This makes researches on collective communication in distributed training necessary. However, we found that current communication frameworks in distributed training are inappropriate, because of their complex architecture and vast codes. To overcome this difficulty, a lightweight collective communication library is designed and implemented for analyzing and improving the collective operations in distributed training conveniently. It supports the mainstream frameworks, and comes with a clean architecture. This makes researchers to implement custom communication operations efficiently, and these operations can be applied to mainstream experimental environments for wider impact. Our collective communication library is evaluated by pure collective operations and distributed deep learning applications respectively in various network situations. The experiments show that the library can achieve similar performance to the MPI, and can be used as an collective communication library for analyzing and researching gradient synchronization in distributed training.

Key words: distributed deep learning, neural network, collective communication, Gloo, unified communication X(UCX)

王笑雨, 董德尊. 一种面向分布式深度学习的轻量级聚合通信库[J]. 计算机工程与科学, 2022, 44(07): 1191-1198.

WANG Xiao-yu, DONG De-zun. A lightweight collective communication library for distributed deep learning[J]. Computer Engineering & Science, 2022, 44(07): 1191-1198.

编辑推荐

Metrics

阅读次数

全文

257

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	257

来源	本网站	其他网站

次数	215	42
比例	84%	16%

摘要

219

最新录用	在线预览	正式出版

0	0	219

	来源	本网站

	次数	219
	比例	100%

[1]	张建民, 许炜康, 刘津津, 黎铁军. 粒子输运非确定性模拟的加速方法研究进展[J]. 计算机工程与科学, 2025, 47(01): 1-9.
[2]	安昕辰. DSP处理器二级缓存的结构优化研究[J]. 计算机工程与科学, 2025, 47(01): 10-17.
[3]	沈洁, 龙标, 黄春, 唐滔, 彭林. 面向向量部件的指数和对数函数优化方法[J]. 计算机工程与科学, 2025, 47(01): 18-26.
[4]	袁梁勇, 齐星云, 吕方旭, 罗章, 黄恒, 张庚, 王文晨, 李萌, 赖明澈. 面向Duobinary信号的时钟恢复电路研究与设计[J]. 计算机工程与科学, 2025, 47(01): 27-34.
[5]	吴玉虹, 王建. 基于Patches-CNN的模拟电路故障诊断[J]. 计算机工程与科学, 2025, 47(01): 35-44.
[6]	唐竹, 陈宝海, 王敬宇, 朱琪. 面向同构非对称多核的OpenOCD调试功能优化研究[J]. 计算机工程与科学, 2025, 47(01): 45-55.
[7]	高颖颖, 田野. 基于分数阶2D-TFCDM映射和改进的Hilbert曲线置乱的图像加密算法[J]. 计算机工程与科学, 2025, 47(01): 66-74.
[8]	武培成, 赵旭俊, 靳黎忠. 基于网格密度积叠的流数据异常检测[J]. 计算机工程与科学, 2025, 47(01): 75-85.
[9]	罗养霞, 李浩, 武晨明. 恶意软件知识图谱的构建与研究[J]. 计算机工程与科学, 2025, 47(01): 86-94.
[10]	徐超, 阮荣耀, 陈勇, . 一种基于区块链的医疗数据审计方法[J]. 计算机工程与科学, 2025, 47(01): 95-106.
[11]	任瑞琳, 杨燕. 通道差先验下的自适应高斯函数去雾算法[J]. 计算机工程与科学, 2025, 47(01): 107-118.
[12]	齐然然, 帕力旦·吐尔逊, 汤泊川, 钱育蓉, . 基于残差注意力编-解码网络的道路提取方法[J]. 计算机工程与科学, 2025, 47(01): 119-129.
[13]	陈兆波, 张琳, 马晓轩. 改进注意力混合自动编码器视频异常检测研究[J]. 计算机工程与科学, 2025, 47(01): 130-139.
[14]	章政, 夏小云, 陈泽丰, 向毅. 融合强化学习的分阶段策略求解旅行背包问题[J]. 计算机工程与科学, 2025, 47(01): 140-149.
[15]	陈欣然, 刘宁, 闫中敏, 刘磊, 崔立真. 基于注意力指导的双粒度跨模态医学特征学习框架[J]. 计算机工程与科学, 2025, 47(01): 150-159.