• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (01): 1-9.

• 高性能计算 • 上一篇    下一篇

一种矩阵块间提前切换的脉动阵列优化策略

鞠鑫,曹亚松,文梅,汪志,冯静   

  1. (国防科技大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2022-10-13 修回日期:2022-11-15 接受日期:2023-01-25 出版日期:2023-01-25 发布日期:2023-01-25
  • 基金资助:
    国家自然科学基金(62002366)

A systolic array optimization strategy for switching matrix blocks in advance

JU Xin,CAO Ya-song,WEN Mei,WANG Zhi,FENG Jing   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2022-10-13 Revised:2022-11-15 Accepted:2023-01-25 Online:2023-01-25 Published:2023-01-25

摘要: AI应用对硬件算力的需求逐年增加,驱使着AI加速器不断向更高的性能演化。研究表明,AI应用的主要运算形式可以转化为矩阵乘运算,脉动阵列因为在矩阵乘运算上的独特优势,使其成为了主流矩阵乘加速技术之一。然而,矩阵在注入和流出脉动阵列时存在一定的流水线启动和排空开销,特别是支持训练的浮点脉动阵列,其MAC延时往往大于1,矩阵块间切换不及时会导致PE利用率急剧下降。针对上述问题,基于典型应用场景进行理论分析,提出了一种矩阵块间提前切换策略,能够精确计算出各种情况下的矩阵块间最优切换时刻。同时,还实现了RTL设计。经过实验对比可知,优化后的脉动阵列增加的硬件开销微乎其微,但在所有场景中均能得到性能提升。

关键词: 脉动阵列, AI, 矩阵乘, 加速器, PE利用率

Abstract: The demand for hardware computing power in AI applications increases year by year, driving the evolution of AI accelerators towards higher performance. Research shows that the main computing form of AI applications can be transformed into matrix multiplication, and systolic array has become one of the mainstream matrix multiplication acceleration technologies because of its unique advantages in matrix multiplication. However, there is a certain amount of pipeline filling and emptying overhead when the matrix is flowed into and out of the systolic array, especially for a floating-point systolic array that supports training, whose MAC latency is greater than 1. Untimely switching between matrix blocks will lead to a sharp drop in PE utilization. To solve these problems, theoretical analysis based on typical application scenarios is conducted, and an early switching strategy between matrix blocks is proposed, which can accurately calculate the optimal switching time between matrix blocks in various situations. The RTL design was implemented. The experimental results show that the hardware overhead of the optimized systolic array is slightly increased, but the performance can be improved in all scenarios.

Key words: systolic array, AI, GEMM, acceleration, processing element(PE) utilization