• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2022, Vol. 44 ›› Issue (07): 1191-1198.

• High Performance Computing • Previous Articles     Next Articles

A lightweight collective communication library for distributed deep learning

WANG Xiao-yu,DONG De-zun   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2021-12-20 Revised:2022-03-03 Accepted:2022-07-25 Online:2022-07-25 Published:2022-07-25

Abstract: Collective communication operations are widely used in distributed training, especially AllReduce operations are used to synchronize model parameters on each node. In order to obtain higher accuracy, the scale of datasets and neural network models is getting larger and larger, and the communication overhead between nodes accounts for a large proportion in the training process and becomes a bottleneck for accelerating training. There have been many optimizations for collective operations in this scene, such as communication scheduling and gradient quantization, but they typically focus on the rational employ instead of the operations themselves. Actually, there are mismatches between the collective operations and distributed training applications. For example, the latter does not require all nodes to synchronize gradients simultaneously while the former does. This makes researches on collective communication in distributed training necessary. However, we found that current communication frameworks in distributed training are inappropriate, because of their complex architecture and vast codes. To overcome this difficulty, a lightweight collective communication library is designed and implemented for analyzing and improving the collective operations in distributed training conveniently. It supports the mainstream frameworks, and comes with a clean architecture. This makes researchers to implement custom communication operations efficiently, and these operations can be applied to mainstream experimental environments for wider impact. Our collective communication library is evaluated by pure collective operations and distributed deep learning applications respectively in various network situations. The experiments show that the library can achieve similar performance to the MPI, and can be used as an collective communication library for analyzing and researching gradient synchronization in distributed training.

Key words: distributed deep learning, neural network, collective communication, Gloo, unified communication X(UCX)