[1] |
Ouyang S,Dong D Z,Xu Y M,et al.Communication optimization strategies for distributed deep neural network training:A survey[J].Journal of Parallel and Distributed Computer,2021,149:52-65.
|
[2] |
Jia X Y,Song S T,He W,et al.Highly scalable deep learning training system with mixed-precision:Training ImageNet in four minutes[J].arXiv:1807.11205,2018.
|
[3] |
Sapio A,Canini M, Ho C Y,et al.Scaling distributed machine learning with in-network aggregation[C]∥Proc of the 18th USENIX Symposium on Networked Systems Design and Implementation,2021:785-808.
|
[4] |
Graham R L,Bureddy D,Lui P,et al.Scalable hierarchical aggregation protocol (SHArP):A hardware architecture for efficient data reduction[C]∥Proc of the 1st Workshop on Optimization of Communication in HPC,2016:1-10.
|
[5] |
Almási G,Dózsa G,Erway C C,et al.Efficient implementation of allreduce on BlueGene/L collective network[C]∥Proc of the 12th European PVM/MPI Users’ Group Conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface,2005:57-66.
|
[6] |
Hemmert K S,Barrett B, Underwood K D.Using triggered operations to offload collective communication operations[C]∥Proc of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface,2010:249-256.
|
[7] |
Li Y J, Liu I J,Yuan Y F,et al.Accelerating distributed reinforcement learning with in-switch computing[C]∥Proc of 2019 ACM/IEEE 46th International Symposium on Computer Architecture,2019:279-291.
|
[8] |
Wagner A,Jin H-W,Panda D K,et al.NIC-based offload of dynamic user-defined modules for Myrinet clusters[C]∥Proc of 2014 IEEE International Conference on Cluster Computing,2004:205-214.
|
[9] |
Facebook. Gloo[EB/OL].[2017-12-13].https://github.com/facebookncubtor/gloo.
|
[10] |
MPI [EB/OL].[1992-10-10].https://www.mpi-forum.org/.
|
[11] |
NVIDIA MCCL[EB/OL].[2018-12-13].https://developer.nvidia.com/nccl.
|
[12] |
OpenMPI[EB/OL].[2004-12-13].https://www.open-mpi.org/.
|
[13] |
Gibiansky A. Bringing HPC techniques to deep learning [EB/OL]. [2022-05-16].https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/.
|
[14] |
Cho M,Finkler U,Serrano M,et al.BlueConnect:Decompos- ing all-reduce for deep learning on heterogeneous network hierarchy[J].IBM Journal of Research and Development,2019,63(6):1:1-1:11.
|
[15] |
Mikami H,Suganuma H,U-Chupala P,et al.Massively distributed SGD:ImageNet/ResNet-50 training in a flash[J].arXiv:1811.05233,2018.
|
[16] |
Ying C,Kumar S,Chen D,et al.Image classification at supercomputer scale[J].arXiv:1811.06992,2018.
|
[17] |
Li S G,Ben-Nun T,Girolamo S D,et al.Taming unbalanced training workloads in deep learning with partial collective operations[C]∥Proc of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,2020:45-61.
|
[18] |
Miao X P,Nie X N,Shao Y X,et al.Heterogeneity-aware distributed machine learning training via partial reduce[C]∥Proc of the 2021 International Conference on Management of Data,2021:2262-2270.
|
[19] |
MPICH[EB/OL].[2022-05-16].https://www.mpich.org/.
|
[20] |
Chen T,Li M,Li Y,et al.MXNet:A flexible and efficient machine learning library for heterogeneous distributed systems[J]. arXiv:1512.01274,2015.
|
[21] |
Abadi M, Agarwal A, Barham P, et al. TensorFlow:Large-scale machine learning on heterogeneous distributed systems[J]. arXiv:1603.04467,2016.
|
[22] |
Paszke A,Gross S,Massa F,et al.PyTorch:An imperative style, high-performance deep learning library[C]∥Proc of the 33rd International Conference on Neural Information Processing Systems, 2019:8026-8037.
|