• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

一种基于GPU的高性能稀疏卷积神经网络优化

方程,邢座程,陈顼颢,张洋   

  1. (国防科技大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2018-06-21 修回日期:2018-08-15 出版日期:2018-12-25 发布日期:2018-12-25
  • 基金资助:

    国家自然科学基金(61170083)

A GPU-based high-performance optimization method
of sparse convolutional neural networks

FANG Cheng,XING Zuocheng,CHEN Xuhao,ZHANG Yang   

  1. (College of Computer,National University of Defense Technology,Changsha 410073,China)
  • Received:2018-06-21 Revised:2018-08-15 Online:2018-12-25 Published:2018-12-25

摘要:

卷积神经网络CNN目前作为神经网络的一个重要分支,相比于其他神经网络方法更适合应用于图像特征的学习和表达。随着CNN的不断发展,CNN将面临更多的挑战。CNN参数规模变得越来越大,这使得CNN对计算的需求量变得非常大。因此,目前产生了许多种方式对CNN的规模进行压缩。然而压缩后的CNN模型往往产生了许多稀疏的数据结构,这种稀疏结构会影响CNN在GPU上的性能。为了解决该问题,采用直接稀疏卷积算法,来加速GPU处理稀疏数据。根据其算法特点将卷积运算转换为稀疏向量与稠密向量内积运算,并将其在GPU平台上实现。本文的优化方案充分利用数据稀疏性和网络结构来分配线程进行任务调度,利用数据局部性来管理内存替换,使得在稀疏卷积神经网络SCNN中的GPU仍能够高效地处理卷积层运算。相比cuBLAS的实现,在AlexNet、GoogleNet、ResNet上的性能提升分别达到1.07×~1.23×、1.17×~3.51×、1.32×~5.00×的加速比。相比cuSPARSE的实现,在AlexNet、GoogleNet、ResNet上的性能提升分别达到1.31×~1.42×、1.09×~2.00×、1.07×~3.22×的加速比。

关键词: 卷积神经网络, 稀疏, 并行, 优化, 图形处理器

Abstract:

As an important branch of neural networks, the convolutional neural network (CNN) is currently more suitable for learning and expressing image features than other neural network methods. With the continuous development of the CNN, there are more challenges. The parameters scale of the CNN is growing larger, which makes the demand for computation enormous. There are many ways to compress CNN scale, however, the compressed CNN usually introduces a number of sparse data structures. These sparse data structures can hurt the performance of the CNN on GPU. In order to solve this problem, we adopt the direct sparse convolution algorithm proposed in 2017 to accelerate GPU’s processing of sparse data. According to the characteristics of this algorithm, we transform convolution operation into an inner product of the sparse vector and dense vector on GPU platform. Our optimization makes full use of the sparse data and network structure to allocate threads for task scheduling, and uses data locality to manage memory replacement. It enables the GPU to deal with the operation on the convolution layer efficiently in the sparse CNN. Compared with the cuBLAS, our proposal achieves a speedup of 1.07×~1.23×, 1.17×~3.51×and 1.32×~5.00× on AlexNet, GoogleNet and ResNet respectively. Compared with the cuSPARSE, our method achieves a speedup of 1.31×~1.42×, 1.09×~2.00×and 1.07×~3.22× on AlexNet, GoogleNet, and ResNet respectively.

Key words: convolutional neural network, sparse, parallel, optimization, GPU