基于平铺数据流的可配置神经网络加速器

计算机工程与科学

基于平铺数据流的可配置神经网络加速器

李艺煌，马胜，郭阳，陈桂林，徐睿

（国防科技大学计算机学院，湖南长沙 410073）

收稿日期:2018-11-23 修回日期:2019-01-04 出版日期:2019-06-25 发布日期:2019-06-25
基金资助:
国家自然科学基金(61672526);校预研基金(ZK170306)

A configurable convolutional neural network

accelerator based on tiling dataflow

LI Yihuang，MA Sheng，GUO Yang，CHEN Guilin，XU Rui

（School of Computer,National University of Defense Technology,Changsha 410073,China）

Received:2018-11-23 Revised:2019-01-04 Online:2019-06-25 Published:2019-06-25

摘要/Abstract

摘要：

卷积神经网络已经是公认最好的用于深度学习的算法，被广泛地应用于图像识别、自动翻译和广告推荐。由于神经网络结构规模的逐渐增大，使其具有大量的神经元和突触，所以，使用专用加速硬件挖掘神经网络的并行性已经成为了热门的选择。在硬件设计中,经典的平铺结构实现了很高的性能，但是平铺结构的单元利用率很低。目前，随着众多深度学习应用对硬件性能要求的逐渐提高，加速器对单元利用率也具有越来越严格的要求。为了在平铺数据流结构上获得更高的单元利用率，可以调换并行的顺序,采用并行输入特征图和输出通道的方式来提高计算的并行性。但是，随着神经网络运算对硬件性能要求的提高，运算单元阵列必然会越来越大。当阵列大小增加到一定程度，相对单一的并行方式会使利用率逐渐下降。这就需要硬件可以开发更多的神经网络并行度，从而抑制单元空转。同时，为了适应不同的网络结构，要求硬件阵列对神经网络的运算是可配置的。但是，可配置硬件会极大地增加硬件开销和数据的调度难度。提出了一种基于平铺结构加速器的并行度可配置的神经网络加速器。为了减少硬件复杂度，提出了部分配置的技术，既能满足大型单元阵列下单元利用率的提升，也能尽可能地减少硬件额外开销。在阵列大小超过512之后，硬件单元利用率平均可以维持在82%～90%。同时加速器性能与单元阵列数量基本成线性比例上升。

关键词: 神经网络, 平铺数据流, 可配置, 单元利用率, 并行性

Abstract:

Convolutional neural networks (CNNs) have been recognized as the best algorithm for deep learning, and they are widely used in image recognition, automatic translation and advertising recommendations. Due to the increasing size of the neural network, the number of the neurons and synapses of the network is also enlarged. Therefore, using specific acceleration hardware to mine the parallelism of CNNs becomes a popular choice. For hardware design, the classic tiling dataflow has achieved high performance. However, the utilization of processing elements of the tiling structure is very low. As deep learning applications demand higher hardware performance, accelerators require higher utilization of processing elements. In order to achieve this goal, we can change the scheduling order to improve the performance, and use parallel input feature graphs and output channels to improve computing parallelism. However, as neural network computation's demand on hardware performance increases, the array size of processing elements inevitably becomes larger and larger. When the array size is increased to a certain extent, a single parallel approach makes utilization gradually to decrease. This requires hardware to develop more neural network parallelism, thereby suppressing element idling. At the same time, in order to adapt to different network structures, configurable operation of hardware arrays on the neural network is required. But configurable hardware can greatly increase hardware overhead and data scheduling difficulty. So, we propose a configurable neural network accelerator based on tiling dataflow. In order to reduce hardware complexity, we propose a partial configuration technique, which can not only improve the utilization of processing elements under large array, but also reduce hardware overhead as much as possible. When the array size of processing elements exceeds 512, the utilization can maintain at an average of 82%~90%. And the accelerator performance is almost linearly proportional to the number of processing elements.

Key words: CNN, tiling dataflow, configurable, parallelism

李艺煌，马胜，郭阳，陈桂林，徐睿. 基于平铺数据流的可配置神经网络加速器[J]. 计算机工程与科学.

LI Yihuang，MA Sheng，GUO Yang，CHEN Guilin，XU Rui.

A configurable convolutional neural network

accelerator based on tiling dataflow

[J]. Computer Engineering & Science.

编辑推荐

Metrics

阅读次数

全文

248

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	248	0	0

来源	本网站	其他网站

次数	198	50
比例	80%	20%

摘要

222

最新录用	在线预览	正式出版

222	0	0

	来源	本网站

	次数	222
	比例	100%

[1]	沈凡凡, 汤星译, 张军, 徐超, 陈勇, 何炎祥. 基于改进萤火虫算法和长短期记忆网络的恶意行为检测方法[J]. 计算机工程与科学, 2024, 46(12): 2158-2170.
[2]	毛润泽, 吴子恒, 徐嘉阳, 章严, 陈帜, . DeepFlame：基于深度学习和高性能计算的反应流模拟开源平台[J]. 计算机工程与科学, 2024, 46(11): 1901-1907.
[3]	徐欣, 李若诗, 袁野, 刘娜. 基于可学习图像滤波器的雾天驾驶场景图像语义分割[J]. 计算机工程与科学, 2024, 46(11): 2027-2034.
[4]	付燕, 杨旭, 叶鸥. 基于CNN和Transformer特征融合的烟雾识别方法[J]. 计算机工程与科学, 2024, 46(11): 2045-2052.
[5]	陈子雄, 陈旭, 景永俊, 宋吉飞. 基于图神经网络的源代码漏洞检测研究综述[J]. 计算机工程与科学, 2024, 46(10): 1775-1792.
[6]	陈昌奉, 赵宏州, 周恺卿. 基于图神经网络的代码抄袭检测方法[J]. 计算机工程与科学, 2024, 46(10): 1815-1824.
[7]	张悦, 张磊, 刘佰龙, 梁志贞, 张雪飞. 基于时空Transformer的多空间尺度交通预测模型[J]. 计算机工程与科学, 2024, 46(10): 1852-1863.
[8]	王鹏, 张嘉诚, 范毓洋, . 适应于硬件部署的神经网络剪枝量化算法[J]. 计算机工程与科学, 2024, 46(09): 1547-1553.
[9]	袁佳伟, 赵进. 基于图神经网络的OMCI模型相似性计算[J]. 计算机工程与科学, 2024, 46(09): 1576-1586.
[10]	周祺, 周宁宁. 神经网络增强的成对双线性因子分解机[J]. 计算机工程与科学, 2024, 46(09): 1648-1659.
[11]	吴斯琦, 赵清华, 于雨晨. 基于元学习的图神经网络冷启动推荐[J]. 计算机工程与科学, 2024, 46(09): 1675-1684.
[12]	李猛, 刘姿邑, 宋宇航. 基于双重自表达与最大熵原理的深度子空间聚类算法[J]. 计算机工程与科学, 2024, 46(09): 1685-1692.
[13]	黄至锐, 贾心茹, 朱浩哲, 陈迟晓, . 基于SRAM缓存和存内计算的低功耗关键词唤醒系统[J]. 计算机工程与科学, 2024, 46(08): 1331-1339.
[14]	辛高枫, 刘玉潇, 张青龙, 韩锐, 刘驰. 边缘侧神经网络块粒度领域自适应技术研究[J]. 计算机工程与科学, 2024, 46(08): 1361-1371.
[15]	刘强, 李沐春, 伍晓洁, 王煜恒. S-JSMA：一种低扰动冗余的快速JSMA对抗样本生成方法[J]. 计算机工程与科学, 2024, 46(08): 1395-1402.