A lightweight collective communication library for distributed deep learning

Computer Engineering & Science ›› 2022, Vol. 44 ›› Issue (07): 1191-1198.

• High Performance Computing • Previous Articles Next Articles

A lightweight collective communication library for distributed deep learning

WANG Xiao-yu,DONG De-zun

（College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China）

Received:2021-12-20 Revised:2022-03-03 Accepted:2022-07-25 Online:2022-07-25 Published:2022-07-25

Abstract

Abstract: Collective communication operations are widely used in distributed training, especially AllReduce operations are used to synchronize model parameters on each node. In order to obtain higher accuracy, the scale of datasets and neural network models is getting larger and larger, and the communication overhead between nodes accounts for a large proportion in the training process and becomes a bottleneck for accelerating training. There have been many optimizations for collective operations in this scene, such as communication scheduling and gradient quantization, but they typically focus on the rational employ instead of the operations themselves. Actually, there are mismatches between the collective operations and distributed training applications. For example, the latter does not require all nodes to synchronize gradients simultaneously while the former does. This makes researches on collective communication in distributed training necessary. However, we found that current communication frameworks in distributed training are inappropriate, because of their complex architecture and vast codes. To overcome this difficulty, a lightweight collective communication library is designed and implemented for analyzing and improving the collective operations in distributed training conveniently. It supports the mainstream frameworks, and comes with a clean architecture. This makes researchers to implement custom communication operations efficiently, and these operations can be applied to mainstream experimental environments for wider impact. Our collective communication library is evaluated by pure collective operations and distributed deep learning applications respectively in various network situations. The experiments show that the library can achieve similar performance to the MPI, and can be used as an collective communication library for analyzing and researching gradient synchronization in distributed training.

Key words: distributed deep learning, neural network, collective communication, Gloo, unified communication X(UCX)

WANG Xiao-yu, DONG De-zun. A lightweight collective communication library for distributed deep learning[J]. Computer Engineering & Science, 2022, 44(07): 1191-1198.

[1]	JIANG Jing-fei, HE Yuan-hong, XU Jin-wei, XU Shi-yao, QIAN Xi-fu. NM-SpMM:A semi-structured sparse matrix multiplication algorithm for domestic heterogeneous vector processors [J]. Computer Engineering & Science, 2024, 46(07): 1141-1150.
[2]	TIAN Hong-peng, WU Jing-wei. RIB-NER:A span-based Chinese named entity recognition model [J]. Computer Engineering & Science, 2024, 46(07): 1311-1320.
[3]	YIN Chun-yong, ZHAO Feng. An anomaly detection model of time series based on dual attention and deep autoencoder [J]. Computer Engineering & Science, 2024, 46(05): 826-835.
[4]	MA Chang-lin, SUN Zhuang. Distantly supervised relation extraction based on entity knowledge [J]. Computer Engineering & Science, 2024, 46(05): 945-950.
[5]	CHEN Jie, LI Cheng, LIU Zhong. Convolutional neural network inference and training vectorization method for multicore vector accelerators [J]. Computer Engineering & Science, 2024, 46(04): 580-589.
[6]	Xie-zhong, CHEN Xu, JING Yong-jun, WANG Shu-yang. Semi-supervised website topic classification based on hetero-geneous graph neural networkWANG [J]. Computer Engineering & Science, 2024, 46(04): 635-646.
[7]	WU Xia, ZHENG Hong-ying, XIAO Di. A dual-verification model watermarking scheme based on certification files [J]. Computer Engineering & Science, 2024, 46(04): 647-656.
[8]	YU Tian-ci, GAO Shang. A code summarization generation model fusing multi-structure data [J]. Computer Engineering & Science, 2024, 46(04): 667-675.
[9]	LI Qing-feng, JIN Liu, MA Hui-fang, ZHANG Ruo-yi. A dual-view contrastive learning-guided multi-behavior recommendation method [J]. Computer Engineering & Science, 2024, 46(04): 707-715.
[10]	CAO Hao-dong, WANG Hai-tao, HE Jian-fen. Date-aware sequential recommendation fusing local information of sequences [J]. Computer Engineering & Science, 2024, 46(04): 734-742.
[11]	MA Xue, HE Xing-xing, LAN Yong-qi, LI Ying-fang. Treelet-based graph neural network for premise selection in first-order logic [J]. Computer Engineering & Science, 2024, 46(02): 374-380.
[12]	SUN Qing-xiao, LIU Yi, YANG Hai-long, WANG Yi-qing, JIA Jie, LUAN Zhong-zhi, QIAN De-pei. GNNSched: A GNN inference task scheduling framework on GPU [J]. Computer Engineering & Science, 2024, 46(01): 1-11.
[13]	QIN Wen-qiang, WU Zhong-cheng, ZHANG Jun, LI Fang, . Design of convolutional neural network acceleration system based on heterogeneous platform [J]. Computer Engineering & Science, 2024, 46(01): 12-20.
[14]	HUANG Ze-biao, DONG De-zun, QI Xing-yun. Gloo+: Accelerating distributed training of deep learning using in-network computing [J]. Computer Engineering & Science, 2024, 46(01): 28-36.
[15]	JI Jun-hao, ZHANG Yu-shu, ZHAO Ruo-yu, WEN Wen-ying, DONG Li. Adversarial visible watermark attack based on intelligent evolutionary algorithm [J]. Computer Engineering & Science, 2024, 46(01): 63-71.

A lightweight collective communication library for distributed deep learning

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 15

Recommended Articles

Metrics

Comments