• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (11): 1912-1921.

• High Performance Computing • Previous Articles     Next Articles

A hybrid matrix-vector processor with dynamically reconfigurable dataflow

AI Chenyang1,ZHAO Lechuan,HUA Tao,WANG Xin’an,WANG Ying   

  1. (1.School of Electronic and Computer Enginnering,Peking University,Shenzhen 518000;
    2.Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)
  • Received:2024-11-14 Revised:2025-01-04 Online:2025-11-25 Published:2025-12-04

Abstract: Systolic arrays, as energy-efficient accelerators for general matrix multiplication (GEMM) operators, have garnered widespread attention from both academia and industry. However, they often occupy a substantial amount of area and typically require collaboration with VPU (vector processing unit) components, a combination frequently seen in neural network accelerators. Additionally, they suffer from issues such as low temporal and spatial utilization rates and limited performance in end-to-end scenarios. To address these challenges, a hybrid vector systolic array (HVSA) is proposed by integrating systolic arrays with vector processors. By reusing the storage, broadcasting, and inter-channel communication units within the VPU, this architecture enables reconfigurable capabilities in terms of array shape and data flow, allowing for more efficient support of GEMM and vector operations within an acceptable hardware area overhead. Furthermore, an end-to-end compilation framework tailored for HVSA is introduced, encompassing an MLIR-based compilation frontend, data flow scheduling, and a programming model compatible with the RISC-V vector extension. Experimental data demonstrates that HVSA achieves a 30.30-fold increase in computational speed compared to a systolic array of equivalent area. In end-to-end applications, the average operating time of HVSA is reduced to around 4.7% of the original compared to the "VPU+SA" of the same area, and energy consumption is reduced by approximately 58.7%.


Key words: general matrix multiplication(GEMM), vector operation, systolic array, vector proces- sing unit(VPU), dataflow scheduling, compiler