• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (11): 1912-1921.

• 高性能计算 • 上一篇    下一篇

一种具有动态可重构数据流的混合矩阵向量处理器

艾晨阳,赵乐川,华涛,王新安,王颖   

  1. (1.北京大学信息工程学院,广东 深圳 518000;2.中国科学院计算技术研究所,北京 100190)

  • 收稿日期:2024-11-14 修回日期:2025-01-04 出版日期:2025-11-25 发布日期:2025-12-04
  • 基金资助:
    深圳市孔雀计划(KQTD20200820113105004)

A hybrid matrix-vector processor with dynamically reconfigurable dataflow

AI Chenyang1,ZHAO Lechuan,HUA Tao,WANG Xin’an,WANG Ying   

  1. (1.School of Electronic and Computer Enginnering,Peking University,Shenzhen 518000;
    2.Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)
  • Received:2024-11-14 Revised:2025-01-04 Online:2025-11-25 Published:2025-12-04

摘要: 脉动阵列作为通用矩阵乘法(GEMM)算子的高能效加速器,受到了学术界和工业界广泛关注。然而,它往往占用大量面积,并且通常需要 VPU 单元配合使用,这种组合经常出现在神经网络加速器中。此外,它还存在时间空间利用率低、端到端场景性能有限等问题。为了解决这些问题,通过结合脉动阵列与向量处理器,提出了一种脉动向量处理器HVSA。通过对 VPU 中存储、广播和通道间通信单元进行复用,HVSA可在阵列的形状和数据流方面进行可重构配置,可以在可接受的硬件面积开销的前提下,更有效地支持 GEMM 和向量运算。同时提出了适用于 HVSA 的端到端编译框架,包括基于 MLIR 的编译前端、数据流调度和兼容 RISC-V 向量扩展的编程模型。实验数据表明,与同等面积的脉动阵列相比,HVSA 计算速度提升了 30.30 倍。在端到端应用中,相比同等面积的“VPU+脉动阵列”,HVSA的平均运行时间缩短为原来的约4.7%,能耗减少约 58.7%。


关键词: 通用矩阵乘法, 向量运算, 脉动阵列, 向量处理单元, 数据流调度, 编译器

Abstract: Systolic arrays, as energy-efficient accelerators for general matrix multiplication (GEMM) operators, have garnered widespread attention from both academia and industry. However, they often occupy a substantial amount of area and typically require collaboration with VPU (vector processing unit) components, a combination frequently seen in neural network accelerators. Additionally, they suffer from issues such as low temporal and spatial utilization rates and limited performance in end-to-end scenarios. To address these challenges, a hybrid vector systolic array (HVSA) is proposed by integrating systolic arrays with vector processors. By reusing the storage, broadcasting, and inter-channel communication units within the VPU, this architecture enables reconfigurable capabilities in terms of array shape and data flow, allowing for more efficient support of GEMM and vector operations within an acceptable hardware area overhead. Furthermore, an end-to-end compilation framework tailored for HVSA is introduced, encompassing an MLIR-based compilation frontend, data flow scheduling, and a programming model compatible with the RISC-V vector extension. Experimental data demonstrates that HVSA achieves a 30.30-fold increase in computational speed compared to a systolic array of equivalent area. In end-to-end applications, the average operating time of HVSA is reduced to around 4.7% of the original compared to the "VPU+SA" of the same area, and energy consumption is reduced by approximately 58.7%.


Key words: general matrix multiplication(GEMM), vector operation, systolic array, vector proces- sing unit(VPU), dataflow scheduling, compiler