• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

面向深度学习的SoC架构设计与仿真

崔浩然,李涵,冯煜晶,吴萌,王超,陶冠良,张志敏   

  1. (中国科学院计算技术研究所,北京 100094)
  • 收稿日期:2018-08-25 修回日期:2018-10-17 出版日期:2019-01-25 发布日期:2019-01-25

Design and simulation of a deep learning SoC architecture
 

CUI Haoran,LI Han,FENG Yujing,WU Meng,WANG  Chao,TAO Guanliang,ZHANG Zhimin   

  1. (Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100094,China)
  • Received:2018-08-25 Revised:2018-10-17 Online:2019-01-25 Published:2019-01-25

摘要:

互联网时代信息量的爆炸式增长、深度学习的普及使传统通用计算无法适应大规模、高并发的计算需求。异构计算能够为深度学习释放更强的计算能力,达到更高的性能要求,并可应用于更广阔的计算场景。针对深度学习算法,设计仿真了一款完整的异构计算SoC架构。首先,通过对常用深度学习算法,如GoogleNet、LSTM、SSD,进行计算特征分析,将其归纳为有限个共性算子类,并用图表及结构框图的形式进行展示,同时生成最小算子级别伪指令流。其次,根据提取的算法特征,进行面向深度学习的硬件加速AI IP核设计,构建异构计算SoC架构。最后,通过仿真建模平台进行实验验证,SoC系统的性能功耗比大于1.5 TOPS/W,可通过GoogleNet算法对10路1 080 p 30 fps视频逐帧处理,且每帧端到端的处理时间不超过30 ms。

关键词: 异构计算, 深度学习, 加速部件, 仿真建模

Abstract:

The explosive growth of information volume in the Internet era and the popularization of deep learning have made traditional generalpurpose computing unable to meet largescale, highconcurrency computing requirements. Heterogeneous computing can release greater computing power for deep learning, satisfy higher performance requirements, and be applied to a wider range of computing scenarios. We design and simulate a complete heterogeneous SoC architecture for deep learning. Firstly, we analyze the computational features of commonly used deep learning algorithms such as GoogleNet, VGG and SSD, and summarize them into a limited number of deep learning common operator classes which are displayed in charts and structure diagrams. At the same time, the pseudo instruction stream at the minimum operator level is generated. Then, based on extracted algorithm features, a hardwareaccelerated AI IP core for deep learning is designed, and a heterogeneous computing SoC architecture is constructed. Finally, experimental verification on the simulation modeling platform shows that the performance to power ratio of the SoC system is greater than 1.5 TOPS/W. The 10channel 1080p 30fps video can be processed frame by frame by the GoogleNet algorithm, and the end-to-end processing time of each frame is no more than 30ms.
 

Key words: heterogeneous computing, deep learning, acceleration unit, simulation modeling