• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (03): 389-397.

• 高性能计算 • 上一篇    下一篇

基于3D可扩展PE阵列CNN加速器的设计

苏梓培,杨鑫,陈弟虎,粟涛   

  1. (中山大学电子与信息工程学院,广东 广州 510275)
  • 收稿日期:2020-04-30 修回日期:2020-06-28 接受日期:2021-03-25 出版日期:2021-03-25 发布日期:2021-03-26
  • 基金资助:
    广东省科技计划重大专项(2017B090909005,2019B010140002)

A CNN accelerator based on 3D scalable PE array

SU Zi-pei,YANG Xin,CHEN Di-hu,SU Tao   

  1. (School of Electronics and Information Technology,Sun Yat-sen University,Guangzhou 510275,China)

  • Received:2020-04-30 Revised:2020-06-28 Accepted:2021-03-25 Online:2021-03-25 Published:2021-03-26

摘要: 卷积神经网络具有参数大、运算量大的特点,当将其具体应用在移动端设备时,需要在满足帧率(速度)的前提下,尽量减少功耗与芯片面积。考虑满足现有移动端网络的兼容性、性能和面积等因素,设计一个基于3D可扩展PE阵列的CNN加速器。该加速器兼容3×3卷积、3×3深度可分离卷积、1×1卷积和全连接层,其PE阵列能根据具体应用的网络和硬件约束,设定3个维度上最优的并行度参数,以达到更优的性能。该CNN加速器在512个PE下运行yolo-v2达到76.52 GOPS、74.72%的性能效率,在512个PE下运行mobile-net-v1达到78.05 GOPS、76.22%的性能效率。最后应用CNN加速器构建了一个实时目标检测系统,将yolo-lite网络部署至XILINX Zynq-7000 SoC ZC706硬件开发平台上,其CNN运算性能达到了53.65 fps。

关键词: CNN加速器, 三维PE阵列, 目标检测, SoC

Abstract: Convolutional neural networks have the characteristics of large parameters and large amount of calculation. When specifically applied to mobile devices, it is necessary to reduce the area of the chip as much as possible under the premise of the frame rate (speed). Considering the compatibility performance, area and other factors of the current mobile terminal network, a CNN accelerator based on a 3D scalable PE array is designed. The accelerator is compatible with 3×3 convolution, 3×3 deep separable convolution, 1×1 convolution, and fully connected layer, and its PE array can set the optimal parallelism parameters in three dimensions according to the network and hardware constraints of the specific application to achieve more excellent performance. The proposed CNN accelerator runs yolo-v2 on 512 PEs to achieve 76.52 GOPS (74.72% performance efficiency), and runs mobile-net-v1 on 512 PEs to achieve 78.05 GOPS (76.22% performance efficiency). The CNN accelerator is used to build up a real-time target detection system on ZC706 FPGA board. Running yolo-lite on the board shows that the CNN performance can achieve a frame rate of 53.65 fps.

Key words: CNN accelerator, 3D PE array, target detection, SoC