• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (08): 1340-1348.

• 高性能计算 • 上一篇    下一篇

基于FPGA和行折叠的稀疏矩阵向量乘优化

周智,高建花,计卫星   

  1. (北京理工大学计算机学院,北京 100081)
  • 收稿日期:2023-11-07 修回日期:2023-12-29 接受日期:2024-08-25 出版日期:2024-08-25 发布日期:2024-09-02

Optimization of sparse matrix-vector multiplication based on FPGA and row folding

ZHOU Zhi,GAO Jian-hua,JI Wei-xing   

  1. (School of Computer Science & Technology,Beijing Institute of Technology,Beijing 100081,China)
  • Received:2023-11-07 Revised:2023-12-29 Accepted:2024-08-25 Online:2024-08-25 Published:2024-09-02

摘要: 稀疏矩阵向量乘(SpMV)是科学与工程计算中的一个关键内核。由于稀疏矩阵中不规则的数据分布和SpMV计算中不规则的访存操作,SpMV在多核CPU和GPU等设备上的性能与这些设备的理论峰值还具有较大差距。现有的CPU和GPU由于在架构上受到限制,导致它们无法很好地利用稀疏矩阵的特殊结构来加速SpMV计算,而现场可编程门阵列(FPGA)可以通过自定义电路实现高效的并行运算,能够更好地处理稀疏矩阵的计算和存储问题。基于FPGA提出了一种SpMV优化方法,该优化方法基于高级综合的流式处理引擎,采用了一种自适应多行折叠的SpMV优化策略。该方法通过行折叠减少了处理引擎中零元的无效存储和计算,从而提升了基于FPGA的SpMV计算性能。实验结果表明,相比于现有的FPGA实现方案,设计的基于行折叠优化的数据流引擎实现了最高1.78倍和平均1.15倍的加速。

关键词: 稀疏矩阵向量乘, 现场可编程门阵列, 高级综合, 行折叠

Abstract: Sparse matrix-vector multiplication (SpMV) is a key kernel in scientific and engineering computing. Due to the irregular data distribution in sparse matrices and the irregular memory access operations in SpMV calculations, the performance of SpMV on multicore CPUs and GPUs still lags significantly behind the theoretical peak performance of these devices. Existing CPUs and GPUs are limited in their architectures, making them unable to effectively utilize the special structure of sparse matrices to accelerate SpMV calculations. However, Field-Programmable gate arrays (FPGA) can achieve efficient parallel computing through customized circuits, which better handle the computation and storage issues of sparse matrices. An SpMV optimization method based on FPGA is proposed, which utilizes a high-level synthesis streaming processing engine and employs an adaptive multi-row folding SpMV optimization strategy. This method reduces the ineffective storage and computation of zero elements in the processing engine through row folding, thereby enhancing the performance of FPGA-based SpMV calculations. Experimental results show that compared to existing FPGA implementations, the proposed row folding-based dataflow engine achieves a maximum speedup of 1.78 times and an average speedup of 1.15 times.


Key words: sparse matrix-vector multiplication, field-programmable gate array, high-level synthesis, row folding