Computer Engineering & Science ›› 2021, Vol. 43 ›› Issue (5): 782-791.
Previous Articles Next Articles
WEI Jia,ZHANG Xing-jun,JI Ze-yu,LI Jing-bo,YUE Ying-ying
Received:
Revised:
Online:
Published:
Abstract: The Deep Neural Network (DNN) model is an important branch of the Artificial Neural Network (ANN) model and the foundation of deep learning. In recent years, due to the improvement of computer computing power and the development of high-performance computing technology, it has become possible to increase the DNN network depth and the model complexity to improve its feature extraction and data fitting capabilities. As a result, DNN has shown advantages in natural language processing, autonomous driving, face recognition and other issues. However, big data and complex models have greatly increased the training cost of deep neural networks. Therefore, accelerating the training process has become a key task. Its technical scope covers many aspects from the design of the underlying circuit to the design of distributed algorithms. The peak speed of the domestic Tianhe-3 aimed at one quintillion of times, and the huge computing power provides a potential opportunity for DNN training. Based on the characteristics of the ARM architecture of the Tianhe-3 prototype, using the PyTorch framework and MPI technology, this paper conducts a uniquely designed CNN training for a single FT-2000+ computing node, a single MT-2000+ computing node, and the multi-node cluster expanded through them. The performance of the above-mentioned processors in neural network distributed training has been optimized and evaluated, which provides experimental data and theoretical basis for further improving the performance of the Tianhe-3 prototype system in neural network distributed training.
Key words: Tianhe-3 prototype, deep learning, distributed training, performance evaluation, data pa- rallelism
WEI Jia, ZHANG Xing-jun, JI Ze-yu, LI Jing-bo, YUE Ying-ying. Performance evaluation and optimization of distributed and parallel deep neural network on the Tianhe-3 prototype system[J]. Computer Engineering & Science, 2021, 43(5): 782-791.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2021/V43/I5/782