一种矩阵块间提前切换的脉动阵列优化策略

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (01): 1-9.

一种矩阵块间提前切换的脉动阵列优化策略

鞠鑫，曹亚松,文梅,汪志,冯静

(国防科技大学计算机学院,湖南长沙 410073)

收稿日期:2022-10-13 修回日期:2022-11-15 接受日期:2023-01-25 出版日期:2023-01-25 发布日期:2023-01-25
基金资助:
国家自然科学基金(62002366)

A systolic array optimization strategy for switching matrix blocks in advance

JU Xin，CAO Ya-song，WEN Mei，WANG Zhi，FENG Jing

(College of Computer Science and Technology,National University of Defense Technology，Changsha 410073,China)

Received:2022-10-13 Revised:2022-11-15 Accepted:2023-01-25 Online:2023-01-25 Published:2023-01-25

摘要/Abstract

摘要： AI应用对硬件算力的需求逐年增加，驱使着AI加速器不断向更高的性能演化。研究表明，AI应用的主要运算形式可以转化为矩阵乘运算，脉动阵列因为在矩阵乘运算上的独特优势，使其成为了主流矩阵乘加速技术之一。然而，矩阵在注入和流出脉动阵列时存在一定的流水线启动和排空开销，特别是支持训练的浮点脉动阵列，其MAC延时往往大于1，矩阵块间切换不及时会导致PE利用率急剧下降。针对上述问题，基于典型应用场景进行理论分析，提出了一种矩阵块间提前切换策略，能够精确计算出各种情况下的矩阵块间最优切换时刻。同时，还实现了RTL设计。经过实验对比可知，优化后的脉动阵列增加的硬件开销微乎其微，但在所有场景中均能得到性能提升。

关键词: 脉动阵列, AI, 矩阵乘, 加速器, PE利用率

Abstract: The demand for hardware computing power in AI applications increases year by year, driving the evolution of AI accelerators towards higher performance. Research shows that the main computing form of AI applications can be transformed into matrix multiplication, and systolic array has become one of the mainstream matrix multiplication acceleration technologies because of its unique advantages in matrix multiplication. However, there is a certain amount of pipeline filling and emptying overhead when the matrix is flowed into and out of the systolic array, especially for a floating-point systolic array that supports training, whose MAC latency is greater than 1. Untimely switching between matrix blocks will lead to a sharp drop in PE utilization. To solve these problems, theoretical analysis based on typical application scenarios is conducted, and an early switching strategy between matrix blocks is proposed, which can accurately calculate the optimal switching time between matrix blocks in various situations. The RTL design was implemented. The experimental results show that the hardware overhead of the optimized systolic array is slightly increased, but the performance can be improved in all scenarios.

Key words: systolic array, AI, GEMM, acceleration, processing element(PE) utilization

鞠鑫, 曹亚松, 文梅, 汪志, 冯静. 一种矩阵块间提前切换的脉动阵列优化策略[J]. 计算机工程与科学, 2023, 45(01): 1-9.

JU Xin, CAO Ya-song, WEN Mei, WANG Zhi, FENG Jing. A systolic array optimization strategy for switching matrix blocks in advance[J]. Computer Engineering & Science, 2023, 45(01): 1-9.

编辑推荐

Metrics

阅读次数

全文

407

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	407

来源	本网站	其他网站

次数	337	70
比例	83%	17%

摘要

299

最新录用	在线预览	正式出版

0	0	299

	来源	本网站

	次数	299
	比例	100%

[1]	李胜国, 廖霞, 于恒彪, 黄春, 姜浩, 逯喜燕, 王华林, 成礼智. 面向结构矩阵的可扩展并行矩阵乘算法框架[J]. 计算机工程与科学, 2024, 46(09): 1529-1538.
[2]	姜晶菲, 何源宏, 许金伟, 许诗瑶, 钱希福. NM-SpMM：面向国产异构向量处理器的半结构化稀疏矩阵乘算法[J]. 计算机工程与科学, 2024, 46(07): 1141-1150.
[3]	赵悦, 周桐庆, 曾晖, 蔡志平, 肖侬. 一种面向低功耗移动端到端系统的延迟推送同步策略[J]. 计算机工程与科学, 2024, 46(05): 810-817.
[4]	陈杰, 李程, 刘仲. 面向多核向量加速器的卷积神经网络推理和训练向量化方法[J]. 计算机工程与科学, 2024, 46(04): 580-589.
[5]	周理, 赵祉乔, 潘国腾, 铁俊波, 赵王. 基于RISC-V的图卷积神经网络加速器设计[J]. 计算机工程与科学, 2023, 45(12): 2113-2120.
[6]	王英, 陈文祺, 韩耀郴. 基于元学习和图滤波器的节点分类研究[J]. 计算机工程与科学, 2023, 45(12): 2274-2280.
[7]	易啸, 马胜, 肖侬. 深度学习加速器在不同剪枝策略下的运行优化[J]. 计算机工程与科学, 2023, 45(07): 1141-1148.
[8]	康宇晗, 时洋, 陈照云, 文梅. 面向迈创+MatrixZone异构系统的深度学习编程框架[J]. 计算机工程与科学, 2023, 45(07): 1149-1158.
[9]	刘晓航, 姜晶菲, 许金伟. 基于脉动阵列的层融合注意力模型加速器结构[J]. 计算机工程与科学, 2023, 45(05): 802-809.
[10]	耿梦圆, 解滨, 韩力文, . Lupaş q-Bézier曲线的离散卷积生成与求值算法[J]. 计算机工程与科学, 2023, 45(01): 104-112.
[11]	霍爱清, 李易. 地面箭头标识线检测的改进M2Det算法[J]. 计算机工程与科学, 2022, 44(06): 1090-1096.
[12]	林楷智, 宗艳艳, 孙珑玲, . AI服务器PCIe拓扑应用研究[J]. 计算机工程与科学, 2022, 44(03): 390-395.
[13]	庄鹤林, 杨火根, 夏小云, 廖伟志. 关于矩阵乘法问题的人工蜂群优化算法研究[J]. 计算机工程与科学, 2021, 43(12): 2131-2138.
[14]	赵小强, 姜晶菲, 许金伟, 窦勇. 基于FPGA的卷积神经网络加速器动态余数处理映射模型[J]. 计算机工程与科学, 2021, 43(09): 1521-1528.
[15]	蒋芸, 王发林, 张海. 基于集成分类型深度神经网络的视网膜眼底血管图像分割[J]. 计算机工程与科学, 2021, 43(05): 862-871.