• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (04): 580-589.

• 高性能计算 • 上一篇    下一篇

面向多核向量加速器的卷积神经网络推理和训练向量化方法

陈杰,李程,刘仲   

  1. (国防科技大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2023-01-04 修回日期:2023-05-08 接受日期:2024-04-25 出版日期:2024-04-25 发布日期:2024-04-17
  • 基金资助:
    并行与分布处理国家重点实验室基金(2021-KJWPDL-11)

Convolutional neural network inference and training vectorization method for multicore vector accelerators

CHEN Jie,LI Cheng,LIU Zhong   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2023-01-04 Revised:2023-05-08 Accepted:2024-04-25 Online:2024-04-25 Published:2024-04-17

摘要: 随着以卷积神经网络为代表的深度学习得到广泛应用,神经网络模型中的计算量也急速增长,推动了深度学习加速器的发展。如何针对加速器硬件的体系结构特性进行加速和优化神经网络模型的性能成为研究热点。针对自主设计的多核向量加速器FT-M7004上的VGG网络模型推理和训练算法,分别提出了卷积、池化和全连接等核心算子的向量化映射方法,采用SIMD向量化、DMA双缓冲传输和权值共享等优化策略,充分发挥了向量加速器的体系结构优势,取得了较高的计算效率。实验结果表明,在FT-M7004平台上,卷积层推理和训练的平均计算效率分别达到了86.62%和69.63%;全连接层推理和训练的平均计算效率分别达到了93.17%和81.98%;VGG网络模型在FT-M7004上的推理计算效率超过GPU平台20%以上。

关键词: 多核向量加速器, 卷积神经网络, 推理算法, 训练算法

Abstract: With the widespread application of deep learning, represented by convolutional neural networks (CNNs), the computational requirements of neural network models have increased rapidly, driving the development of deep learning accelerators. The research focus has shifted to how to accelerate and optimize the performance of neural network models based on the architectural characteristics of accelerators. For the VGG network model inference and training algorithms on the independently designed multi core vector accelerator FT-M7004, vectorized mapping methods for core operators such as convolution, pooling, and fully connected layers are proposed. Optimization strategies, including SIMD vectorization, DMA double-buffered transfer, and weight sharing, are employed to fully exploit the architectural advantages of the vector accelerator, achieving high computational efficiency. Experimental results indicate that on the FT-M7004 platform, the average computational efficiency for convolution layer inference and training is 86.62% and 69.63%, respectively; for fully connected layer inference and training, the average computational efficiency reaches 93.17% and 81.98%, respectively. The inference computational efficiency of the VGG network model on FT-M7004 exceeds that on the GPU platform by over 20%.

Key words: multicore vector accelerator, convolutional neural network, inference algorithm, training algorithm