[1] |
Chen T Q, Li M, Li Y T, et al.MXNet:A flexible and efficient machine learning library for heterogeneous distributed systems[J].arXiv:1512.01274,2015.
|
[2] |
Abadi M,Barham P,Chen J,et al.TensorFlow:A system for large-scale machine learning[C]∥Proc of the 12th USENIX Symposium on Operating Systems Design and Implementation,2016:265-283.
|
[3] |
Paszke A, Gross S,Massa F,et al.PyTorch:An imperative style,high-performance deep learning library[C]∥Proc of Annual Conference on Neural Information Processing Systems,2019:8024-8035.
|
[4] |
Peng Y H, Zhu Y B, Chen Y R,et al.A generic communication scheduler for distributed DNN training acceleration[C]∥Proc of the 27th ACM Symposium on Operating Systems Principles,2019:16-29.
|
[5] |
Hashemi S H,Jyothi S A,Godfrey B,et al.Caramel:Accele- rating decentralized distributed deep learning with computation scheduling[J].arXiv:2004.14020,2020.
|
[6] |
Seide F,Fu H,Droppo J,et al.1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs[C]∥Proc of the 15th Annual Conference of the International Speech Communication Association,2014:1058-1062.
|
[7] |
Alistarh D, Grubic D, Li J, et al.QSGD:Communication- efficient SGD via gradient quantization and encoding[J].arXiv:1610.02132,2016.
|
[8] |
Chen C Y,Choi J,Brand D,et al.AdaComp:Adaptive residual gradient compression for data-parallel distributed training[C]∥Proc of the 32nd AAAI Conference on Artificial Intelligence,2018:2827-2835.
|
[9] |
Li S G, Ben-Nun T,Girolamo S D,et al.Taming unbalanced training workloads in deep learning with partial collective operations[C]∥Proc of the 25th ACM Symposium on Principles and Practice of Parallel Programming,2020:45-61.
|
[10] |
Bao Y X, Peng Y H, Chen Y R, et al.Preemptive all- reduce scheduling for expediting distributed DNN training[C]∥Proc of the 39th IEEE Conference on Computer Communications,2020:626-635.
|
[11] |
Liu S, Wang Q L, Zhang J Y,et al.NetReduce:RDMA-compatible in-network reduction for distributed DNN training acceleration[J].arXiv:2009.09736,2020.
|
[12] |
Nguyen T T,Wahib M,Takano R.Topology-aware sparse allreduce for large-scale deep learning[C]∥Proc of the 38th International Performance Computing and Communications Conference,2019:1-8.
|
[13] |
Sapio A,Canini M,Ho C Y,et al.Scaling distributed machine learning with in-network aggregation[C]∥Proc of the 18th USENIX Symposium on Networked Systems Design and Implementation,2019:785-808.
|
[14] |
Li M F, Wen K, Lin H, et al.Improving the performance of distributed MXNet with RDMA[J].International Journal of Parallel Programming,2019,47(3):467-480.
|
[15] |
Jia C F, Liu J N, Jin X,et al.Improving the performance of distributed TensorFlow with RDMA[J].International Journal of Parallel Programming,2018,46(4):674-685.
|
[16] |
Wang S T,Li D,Cheng Y,et al.BML:A high-performance,low-cost gradient synchronization algorithm for DML training[C]∥Proc of the 32nd International Conference on Neural Information Processing Systems,2018:4243-4253.
|
[17] |
Sergeev A,Balso M D.Horovod:Fast and easy distributed deep learning in TensorFlow[J].arXiv:1802.05799,2018.
|
[18] |
Gloo[EB/OL].[2021-08-12].https://github.com/facebookincubator/gloo.
|
[19] |
Shamis P,Venkata M G,Lopez M G,et al.UCX:An open source framework for HPC network APIs and beyond[C]∥Proc of the 23rd IEEE Annual Symposium on High- Performance Interconnects,2015:40-43.
|