• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

3D-MMA:基于3D集成电路的矩阵乘加速结构

王吉军,郝子宇,李宏亮   

  1. (江南计算技术研究所,江苏 无锡 214083)
  • 收稿日期:2019-07-06 修回日期:2019-09-16 出版日期:2019-12-25 发布日期:2019-12-25
  • 基金资助:

    国家科技重大专项(2018ZX01028-102)

3D-MMA:Matrix multiplication accelerator
architecture based on 3D integrated circuits

WANG Ji-jun,HAO Zi-yu,LI Hong-liang   

  1. (Jiangnan Institute of Computing Technology,Wuxi 214083,China)
     
  • Received:2019-07-06 Revised:2019-09-16 Online:2019-12-25 Published:2019-12-25

摘要:

脉动阵列结构规整、吞吐量大,适合矩阵乘算法,广泛用于设计高性能卷积、矩阵乘加速结构。在深亚微米工艺下,通过增大阵列规模来提升芯片计算性能,会导致频率下降、功耗剧增等问题。因此,结合3D集成电路技术,提出了一种将平面脉动阵列结构映射到3D集成电路上的双精度浮点矩阵乘加速结构3D-MMA。首先,设计了针对该结构的分块映射调度算法,提升矩阵乘计算效率;其次,提出了基于3D-MMA的加速系统,构建了3D-MMA的性能模型,并对其设计空间进行探索;最后,评估了该结构实现代价,并同已有先进加速器进行对比分析。实验结果表明,访存带宽为160 GB/s时,采用4层16×16脉动阵列的堆叠结构时,3D-MMA计算峰值性能达3 TFLOPS,效率达99%,且实现代价小于二维实现。在相同工艺下,同线性阵列加速器及K40 GPU相比,3D-MMA的性能是后者的1.36及1.92倍,而面积远小于后者。探索了3D集成电路在高性能矩阵乘加速器设计中的优势,对未来进一步提升高性能计算平台性能具有一定的参考价值。
 

关键词: 3D集成电路, 矩阵乘, 分块算法, 性能模型

Abstract:

With regular dataflow and large throughput, systolic array is widely used for designing high performance convolution and matrix multiplication accelerators. In the deep submicron process, extending the processing array size can improve the chip computation performance, but lead to frequency decrease and sharp power consumption increase. Therefore, based on 3D integrated circuits technology, we propose a double-precision floating-point matrix multiplication accelerator named 3D-MMA, which maps planar systolic arrays onto 3D integrated circuits. Firstly, we propose an efficient matrix multiplication scheduling algorithm for 3D-MMA. Secondly, we present an acceleration system based on 3D-MMA, and build an analytical performance model to quantitatively explore the design space. Finally, we evaluate the 3D-MMA implementation cost and compare the proposal with other existing advanced accelerators. The experimental results show that the integrated circuits with 4-layer 16×16 systolic array can reach up to 3 TFLOPS, its efficiency reach up to 99%, and its implementation cost is less than the planar solution. Under the same process, compared with linear array accelerator and K40 GPU, the performance of 3D-MMA is 1.36 and 1.92 times that of the latter, and its area is much smaller than that of the latter. This paper explores the advantages of 3D integrated circuits in designing high-performance matrix multiplication accelerators, which has certain reference for further improving performance of high-performance platforms in the future.

 

Key words: 3D integrated circuits, matrix multiplication accelerator, blocking algorithm, performance model