• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (12): 2139-2149.

• 高性能计算 • 上一篇    下一篇

YOLOv3-tiny的硬件加速设计及FPGA实现

陈浩敏1,姚森敬1,席禹1,张凡1,辛文成1,王龙海2,任超2   

  1. (1.南方电网数字电网研究院有限公司,广东 广州 510623;2.天津大学电气自动化与信息工程学院,天津 300072)
  • 收稿日期:2020-11-10 修回日期:2021-01-04 接受日期:2021-12-25 出版日期:2021-12-25 发布日期:2021-12-31
  • 基金资助:
    国家自然科学基金(61972282)

Design and FPGA implementation of YOLOv3-tiny hardware acceleration

CHEN Hao-min1,YAO Sen-jing1,XI Yu1,ZHANG Fan1,XIN Wen-cheng1,WANG Long-hai2,REN Chao2   

  1. (1.China Southern Power Grid Digital Grid Research Institute Limited Company,Guangzhou 510623;

    2.School of Electrical Automation and Information Engineering,Tianjin University,Tianjin 300072,China)
  • Received:2020-11-10 Revised:2021-01-04 Accepted:2021-12-25 Online:2021-12-25 Published:2021-12-31

摘要: YOLOv3-tiny具有优秀的目标检测能力,但模型所需的计算力依然较大,难以实现面向嵌入式领域的应用。提出一种YOLOv3-tiny的硬件加速方法,并在FPGA平台上实现。首先,针对网络定点化设计,以数据精度与资源消耗为设计指标,通过对模型中数据分布的统计以及数据类型的划分,提出了不同的定点化策略。其次,针对网络并行化设计,通过对卷积神经网络计算特性的分析,使用循环调整、循环分块、循环展开和数组分割等方法,设计了可扩展的常用硬件计算单元架构。然后,针对网络流水化设计,从层间与层内2个方面进行研究,以层间数据流方向和层内任务划分为基础,设计了一种灵活的流水化计算架构。最后,在XILINX XC7Z020CLG400-1平台上进行实验,结果表明,相较于667 MHz的单核ARM-A9处理器,加速比高达290.56。

关键词: YOLOV3-tiny, 卷积神经网络, FPGA, 硬件加速

Abstract: YOLOv3-tiny has the excellent target detection capability, but the computational power required by the model is still large, so it is difficult to be used in the embedded application field. This paper proposes a hardware acceleration method of YOLOv3-tiny and implements it on FPGA platform. Firstly, for the fixed-point design of the network, with data accuracy and resource consumption as design indicators, through the statistics of the data distribution in the model and the division of data types, different fixed-point strategies are determined. Secondly, for the parallel design of the network, through the analysis of the calculation characteristics of the convolutional neural network, with the methods of loop adjustment, loop block, loop expansion, and array splitting, a scalable common hardware comput- ing unit is designed. Then, for the network pipeline design, the research is carried out from two aspects: the inter-layer and the intra-layer. Based on the direction of the inter-layer data flow and the division of tasks within the layer, a flexible pipeline computing architecture is designed. Lastly, on the XILINX XC7Z020CLG400-1 platform, experiments demonstrate that, compared with single-core ARM-A9 processor at 667MHz, the proposal achieves the calculation speed as high as 290.56. 


Key words: YOLOv3-tiny, convolutional neural network, field programmable gate array, hardware acceleration ,