• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (01): 12-20.

• 高性能计算 • 上一篇    下一篇

基于异构平台的卷积神经网络加速系统设计

秦文强1,吴仲城2,3,张俊2,3,李芳2,3    

  1.  (1.安徽大学物质科学与信息技术研究院,安徽 合肥 230601;
    2.中国科学院合肥物质科学研究院强磁场科学中心,安徽 合肥 230031;3.强磁场安徽省实验室,安徽 合肥 230031)
  • 收稿日期:2022-09-27 修回日期:2023-02-22 接受日期:2024-01-25 出版日期:2024-01-25 发布日期:2024-01-15
  • 基金资助:
    中国科学院合肥大科学中心重点研发项目(2019HSC-KPRD003);合肥综合性国家科学中心项目(QGCYY04)

Design of convolutional neural network acceleration system based on heterogeneous platform

QIN Wen-qiang1,WU Zhong-cheng2,3,ZHANG Jun2,3,LI Fang2,3   

  1. (1.Institute of Physical Science and Information Technology,Anhui University,Hefei 230601;
    2.Center for High Magnetic Field Science,Hefei Institutes of Physical Science,Chinese Academy of Sciences,Hefei 230031;
    3.High Magnetic Field Laboratory of Anhui Province,Hefei 230031,China)
  • Received:2022-09-27 Revised:2023-02-22 Accepted:2024-01-25 Online:2024-01-25 Published:2024-01-15

摘要: 在计算和存储资源受限的嵌入式设备上部署卷积神经网络,存在执行速度慢、计算效率低、功耗高的问题。提出了一种基于异构平台的新型卷积神经网络加速架构,设计并实现了基于MobileNet的轻量化卷积神经网络加速系统。首先,为降低硬件资源消耗以及数据传输成本,采用动态定点数量化和批标准化融合的设计方法,对网络模型进行了优化,并降低了加速系统的硬件设计复杂度;其次,通过实现卷积分块、并行卷积计算、数据流优化,有效提高了卷积运算效率和系统吞吐率。在PYNQ-Z2平台上的实验结果表明,此加速系统实现的MobileNet网络推理加速方案对单幅图像的识别时间为0.18 s,系统功耗为2.62 W,相较于ARM单核处理器加速效果提升了128倍。

关键词: 现场可编程门阵列(FPGA), Vivado高层次综合, 卷积神经网络, 异构平台, 硬件加速

Abstract: Deploying convolutional neural networks (CNN) on embedded devices with limited computing and storage resources poses challenges such as slow execution speed, low computational efficiency, and high power consumption. This paper proposes a novel CNN acceleration architecture based on a heterogeneous platform, and designs and implements a lightweight CNN acceleration system based on MobileNet. Firstly, to reduce hardware resource consumption and data transmission costs, a design method combining dynamic fixed-point quantization and batch normalization fusion is employed to optimize the network model and reduce the hardware design complexity of the acceleration system. Secondly, by implementing convolutional block partitioning, parallel convolutional computation, and data flow optimization, the efficiency of convolutional operations and system throughput are effectively improved. Experimental results on the PYNQ-Z2 platform demonstrate that the MobileNet network inference acceleration scheme implemented by this acceleration system achieves a recognition time of 0.18 seconds per image and a system power consumption of 2.62 watts, representing a 128-fold improvement in acce- leration performance compared to an ARM single-core processor.


Key words: field programmable gate array (FPGA), Vivado high level synthesis, convolutional neural network, heterogeneous platform, hardware acceleration