• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (04): 577-581.

• 高性能计算 • 上一篇    下一篇

卷积神经网络硬件加速的通用性设计

王玉雷,谢凯亮,陈思贇,胡杰,常胜   

  1. (武汉大学物理科学与技术学院,湖北 武汉 430072)
  • 收稿日期:2022-01-19 修回日期:2022-05-09 接受日期:2023-04-25 出版日期:2023-04-25 发布日期:2023-04-13

A universal design on hardware acceleration of convolutional neural networks

WANG Yu-lei,XIE Kai-liang,CHEN Si-yun,HU Jie,CHANG Sheng   

  1. (School of Physics and Technology,Wuhan University,Wuhan 430072,China)
  • Received:2022-01-19 Revised:2022-05-09 Accepted:2023-04-25 Online:2023-04-25 Published:2023-04-13

摘要: 随着人工智能的兴起,应用于各种场景的神经网络算法蓬勃发展。这使得以卷积神经网络为代表的各类算法的通用边缘部署加速设计成为了一大难题。对此,提出了基于数据相关性原理和Roofline模型的一般性和通用性设计准则,并据此对神经网络进行面向硬件加速的并行化设计。对卷积层、池化层和全连接层这3个最重要的部分进行了优化,基于优化后的模块可根据应用场景需求搭建各种卷积神经网络,从而实现通用性设计。以LeNet-5网络为对象,在XILINX ZC702和XILINX ZC706 FPGA平台上分别以MNIST测试集为基准验证,对各层优化后基于高层次综合构建的交互式识别系统,在XILINX ZC702平台上达到了95.09%的准确率和每幅图像4.1 ms的推理速度,在XILINX ZC706平台上达到了相同的准确率和每幅图像0.997 ms的推理速度,二者都具备了很高的处理速度。

关键词: 神经网络, 硬件加速, 通用性设计, FPGA, 高层次综合, Roofline, 数据相关性

Abstract: With the rise of artificial Intelligence, neural network algorithms used in various scenarios are developing vigorously and ever-changing. This makes the general edge deployment acceleration design of various algorithms represented by convolutional neural networks a big problem. In view of this situation, based on the principle of data correlation and Roofline model, a general and universal design rule is proposed to design hardware-paralleled convolutional neural network. The three most important parts such as the convolution layer, the pooling layer and the full connection layer are optimized. Based on the optimized modules, various convolutional neural networks can be built according to the requirements of application scenarios, so as to achieve universal design. With LeNet-5 network as the verification object and MNIST test set as the benchmark, the verification was carried out on XILINX ZC702 and XILINX ZC706 FPGA platforms. The interactive recognition system constructed based on high-level synthesis after optimization of each layer achieves 95.09% accuracy and 4.1 ms/ sheet reasoning speed on XILINX ZC702 platform, and the same accuracy and 0.997 ms/sheet reasoning speed on XILINX ZC706 platform. Both have very high processing speed.  

Key words: neural network, hardware acceleration, universal design, FPGA, high-level synthesis, Roofline, data correlation