• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (05): 782-791.

• 高性能计算 • 上一篇    下一篇

天河三号原型机分布式并行深度神经网络性能评测及调优

魏嘉,张兴军,纪泽宇,李靖波,岳莹莹   

  1. (西安交通大学计算机科学与技术学院,陕西 西安 710127)

  • 收稿日期:2020-12-08 修回日期:2021-02-04 接受日期:2021-05-25 出版日期:2021-05-25 发布日期:2021-05-19
  • 基金资助:
    国家重点研发计划(2016YFB0200902)

Performance evaluation and optimization of distributed and parallel deep neural network on the Tianhe-3 prototype system

WEI Jia,ZHANG Xing-jun,JI Ze-yu,LI Jing-bo,YUE Ying-ying   

  1. (School of Computer Science and Technology,Xi’an Jiaotong University,Xi’an 710127,China)

  • Received:2020-12-08 Revised:2021-02-04 Accepted:2021-05-25 Online:2021-05-25 Published:2021-05-19

摘要: 深度神经网络DNN模型是人工神经网络ANN模型的重要分支,是深度学习的基础。近年来,由于计算机算力的提升和高性能计算技术的发展,使得通过增加DNN网络深度和模型复杂度来提高其特征提取和数据拟合的能力成为可能,从而使DNN在自然语言处理、自动驾驶和人脸识别等问题上显现了优势。然而海量的数据和复杂的模型大大提高了深度神经网络的训练开销,因此加速其训练过程成为了一项关键任务,其技术范围涵盖从底层电路设计到分布式算法设计等多个方面。国产天河三号原型机峰值速度的设计目标为百亿亿级,巨大的计算能力为DNN训练提供了潜在的契机。针对天河三号原型机ARM架构特点,采用PyTorch框架与MPI技术,针对单个MT-2000+计算节点、单个FT-2000+计算节点,以及通过拓展的多节点集群设计CNN训练策略,并对上述处理器在神经网络分布式训练的性能做出了评测和优化,为进一步提升和改进天河三号原型机在神经网络大规模分布式训练方面的表现提供了实验数据和理论依据。

关键词: 天河三号原型机;深度学习;分布式训练;性能评测;数据并行 

Abstract: The Deep Neural Network (DNN) model is an important branch of the Artificial Neural Network (ANN) model and the foundation of deep learning. In recent years, due to the improvement of computer computing power and the development of high-performance computing technology, it has become possible to increase the DNN network depth and the model complexity to improve its feature extraction and data fitting capabilities. As a result, DNN has shown advantages in natural language processing, autonomous driving, face recognition and other issues. However, big data and complex models have greatly increased the training cost of deep neural networks. Therefore, accelerating the training process has become a key task. Its technical scope covers many aspects from the design of the underlying circuit to the design of distributed algorithms. The peak speed of the domestic Tianhe-3 aimed at one quintillion of times, and the huge computing power provides a potential opportunity for DNN training. Based on the characteristics of the ARM architecture of the Tianhe-3 prototype, using the PyTorch framework and MPI technology, this paper conducts a uniquely designed CNN training for a single FT-2000+ computing node, a single MT-2000+ computing node, and the multi-node cluster expanded through them. The performance of the above-mentioned processors in neural network distributed training has been optimized and evaluated, which provides experimental data and theoretical basis for further improving the performance of the Tianhe-3 prototype system in neural network distributed training.



Key words: Tianhe-3 prototype, deep learning, distributed training, performance evaluation, data pa- rallelism