3D-MMA:基于3D集成电路的矩阵乘加速结构

计算机工程与科学

3D-MMA:基于3D集成电路的矩阵乘加速结构

王吉军，郝子宇，李宏亮

（江南计算技术研究所,江苏无锡 214083）

收稿日期:2019-07-06 修回日期:2019-09-16 出版日期:2019-12-25 发布日期:2019-12-25
基金资助:
国家科技重大专项（2018ZX01028-102）

3D-MMA:Matrix multiplication accelerator

architecture based on 3D integrated circuits

WANG Ji-jun，HAO Zi-yu，LI Hong-liang

（Jiangnan Institute of Computing Technology,Wuxi 214083,China）

Received:2019-07-06 Revised:2019-09-16 Online:2019-12-25 Published:2019-12-25

摘要/Abstract

摘要：

脉动阵列结构规整、吞吐量大，适合矩阵乘算法，广泛用于设计高性能卷积、矩阵乘加速结构。在深亚微米工艺下，通过增大阵列规模来提升芯片计算性能，会导致频率下降、功耗剧增等问题。因此，结合3D集成电路技术，提出了一种将平面脉动阵列结构映射到3D集成电路上的双精度浮点矩阵乘加速结构3D-MMA。首先，设计了针对该结构的分块映射调度算法，提升矩阵乘计算效率；其次，提出了基于3D-MMA的加速系统，构建了3D-MMA的性能模型，并对其设计空间进行探索；最后，评估了该结构实现代价，并同已有先进加速器进行对比分析。实验结果表明，访存带宽为160 GB/s时，采用4层16×16脉动阵列的堆叠结构时，3D-MMA计算峰值性能达3 TFLOPS，效率达99%，且实现代价小于二维实现。在相同工艺下，同线性阵列加速器及K40 GPU相比，3D-MMA的性能是后者的1.36及1.92倍，而面积远小于后者。探索了3D集成电路在高性能矩阵乘加速器设计中的优势，对未来进一步提升高性能计算平台性能具有一定的参考价值。

关键词: 3D集成电路, 矩阵乘, 分块算法, 性能模型

Abstract:

With regular dataflow and large throughput, systolic array is widely used for designing high performance convolution and matrix multiplication accelerators. In the deep submicron process, extending the processing array size can improve the chip computation performance, but lead to frequency decrease and sharp power consumption increase. Therefore, based on 3D integrated circuits technology, we propose a double-precision floating-point matrix multiplication accelerator named 3D-MMA, which maps planar systolic arrays onto 3D integrated circuits. Firstly, we propose an efficient matrix multiplication scheduling algorithm for 3D-MMA. Secondly, we present an acceleration system based on 3D-MMA, and build an analytical performance model to quantitatively explore the design space. Finally, we evaluate the 3D-MMA implementation cost and compare the proposal with other existing advanced accelerators. The experimental results show that the integrated circuits with 4-layer 16×16 systolic array can reach up to 3 TFLOPS, its efficiency reach up to 99%, and its implementation cost is less than the planar solution. Under the same process, compared with linear array accelerator and K40 GPU, the performance of 3D-MMA is 1.36 and 1.92 times that of the latter, and its area is much smaller than that of the latter. This paper explores the advantages of 3D integrated circuits in designing high-performance matrix multiplication accelerators, which has certain reference for further improving performance of high-performance platforms in the future.

Key words: 3D integrated circuits, matrix multiplication accelerator, blocking algorithm, performance model

王吉军，郝子宇，李宏亮. 3D-MMA:基于3D集成电路的矩阵乘加速结构[J]. 计算机工程与科学.

WANG Ji-jun，HAO Zi-yu，LI Hong-liang.

3D-MMA:Matrix multiplication accelerator

architecture based on 3D integrated circuits

[J]. Computer Engineering & Science.

[1]	彭林, 张鹏, 陈俊峰, 唐滔, 黄春. 基于监督学习的稀疏矩阵乘算法优选[J]. 计算机工程与科学, 2025, 47(03): 381-391.
[2]	李胜国, 廖霞, 于恒彪, 黄春, 姜浩, 逯喜燕, 王华林, 成礼智. 面向结构矩阵的可扩展并行矩阵乘算法框架[J]. 计算机工程与科学, 2024, 46(09): 1529-1538.
[3]	姜晶菲, 何源宏, 许金伟, 许诗瑶, 钱希福. NM-SpMM：面向国产异构向量处理器的半结构化稀疏矩阵乘算法[J]. 计算机工程与科学, 2024, 46(07): 1141-1150.
[4]	刘仲, 李程, 田希, 刘胜, 邓让钰, 钱程东. MVSim：面向VLIW多核向量处理器的快速、可扩展和精确的体系结构模拟器[J]. 计算机工程与科学, 2024, 46(02): 191-199.
[5]	鞠鑫, 曹亚松, 文梅, 汪志, 冯静. 一种矩阵块间提前切换的脉动阵列优化策略[J]. 计算机工程与科学, 2023, 45(01): 1-9.
[6]	庄鹤林, 杨火根, 夏小云, 廖伟志. 关于矩阵乘法问题的人工蜂群优化算法研究[J]. 计算机工程与科学, 2021, 43(12): 2131-2138.
[7]	赵小强, 姜晶菲, 许金伟, 窦勇. 基于FPGA的卷积神经网络加速器动态余数处理映射模型[J]. 计算机工程与科学, 2021, 43(09): 1521-1528.
[8]	贾迅, 钱磊, 原昊, 张昆, 吴东. 矩阵乘协处理器上BLAS level-3运算的设计[J]. 计算机工程与科学, 2020, 42(11): 1913-1921.
[9]	贾迅，邬贵明，钱磊，谢向辉，吴东. 大规模三角线性方程的高效求解[J]. 计算机工程与科学, 2019, 41(02): 240-245.
[10]	甘新标1,2，孙燎原3,刘杰1，雄成伟1，黄嘉昆1. 面向国产异构系统的HPL异构协同设计[J]. 计算机工程与科学, 2018, 40(01): 10-14.
[11]	朱敏,唐波,赵娟,邹丹,李金才. 布尔矩阵乘的分布式异构并行优化[J]. 计算机工程与科学, 2017, 39(04): 634-640.
[12]	沈俊忠，肖涛，乔寓然，杨乾明，文梅. 一种支持优化分块策略的矩阵乘加速器设计[J]. 计算机工程与科学, 2016, 38(09): 1748-1754.
[13]	周磊涛1,2，陶耀东2，刘生1,2，李锁3. 基于FPGA的Systolic乘法技术研究[J]. J4, 2015, 37(09): 1632-1636.
[14]	张帅，李涛，王艺峰，焦晓帆，杨愚鲁. 细粒度任务并行GPU通用矩阵乘[J]. J4, 2015, 37(05): 847-856.
[15]	王锋，杜云飞，陈娟. GPGPU性能模型研究[J]. J4, 2013, 35(12): 1-7.