基于FPGA和行折叠的稀疏矩阵向量乘优化

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (08): 1340-1348.

基于FPGA和行折叠的稀疏矩阵向量乘优化

周智，高建花，计卫星

（北京理工大学计算机学院，北京 100081）

收稿日期:2023-11-07 修回日期:2023-12-29 接受日期:2024-08-25 出版日期:2024-08-25 发布日期:2024-09-02

Optimization of sparse matrix-vector multiplication based on FPGA and row folding

ZHOU Zhi,GAO Jian-hua,JI Wei-xing

(School of Computer Science & Technology,Beijing Institute of Technology,Beijing 100081,China)

Received:2023-11-07 Revised:2023-12-29 Accepted:2024-08-25 Online:2024-08-25 Published:2024-09-02

摘要/Abstract

摘要： 稀疏矩阵向量乘（SpMV）是科学与工程计算中的一个关键内核。由于稀疏矩阵中不规则的数据分布和SpMV计算中不规则的访存操作，SpMV在多核CPU和GPU等设备上的性能与这些设备的理论峰值还具有较大差距。现有的CPU和GPU由于在架构上受到限制，导致它们无法很好地利用稀疏矩阵的特殊结构来加速SpMV计算，而现场可编程门阵列（FPGA）可以通过自定义电路实现高效的并行运算，能够更好地处理稀疏矩阵的计算和存储问题。基于FPGA提出了一种SpMV优化方法，该优化方法基于高级综合的流式处理引擎，采用了一种自适应多行折叠的SpMV优化策略。该方法通过行折叠减少了处理引擎中零元的无效存储和计算，从而提升了基于FPGA的SpMV计算性能。实验结果表明，相比于现有的FPGA实现方案，设计的基于行折叠优化的数据流引擎实现了最高1.78倍和平均1.15倍的加速。

关键词: 稀疏矩阵向量乘, 现场可编程门阵列, 高级综合, 行折叠

Abstract: Sparse matrix-vector multiplication (SpMV) is a key kernel in scientific and engineering computing. Due to the irregular data distribution in sparse matrices and the irregular memory access operations in SpMV calculations, the performance of SpMV on multicore CPUs and GPUs still lags significantly behind the theoretical peak performance of these devices. Existing CPUs and GPUs are limited in their architectures, making them unable to effectively utilize the special structure of sparse matrices to accelerate SpMV calculations. However, Field-Programmable gate arrays (FPGA) can achieve efficient parallel computing through customized circuits, which better handle the computation and storage issues of sparse matrices. An SpMV optimization method based on FPGA is proposed, which utilizes a high-level synthesis streaming processing engine and employs an adaptive multi-row folding SpMV optimization strategy. This method reduces the ineffective storage and computation of zero elements in the processing engine through row folding, thereby enhancing the performance of FPGA-based SpMV calculations. Experimental results show that compared to existing FPGA implementations, the proposed row folding-based dataflow engine achieves a maximum speedup of 1.78 times and an average speedup of 1.15 times.

Key words: sparse matrix-vector multiplication, field-programmable gate array, high-level synthesis, row folding

周智, 高建花, 计卫星. 基于FPGA和行折叠的稀疏矩阵向量乘优化[J]. 计算机工程与科学, 2024, 46(08): 1340-1348.

ZHOU Zhi, GAO Jian-hua, JI Wei-xing. Optimization of sparse matrix-vector multiplication based on FPGA and row folding[J]. Computer Engineering & Science, 2024, 46(08): 1340-1348.

[1]	陈小文, 芮志超, 朱麒瑾, 董羽, 孟宇, . 高精度两步分支混合CORDIC算法设计及FPGA实现[J]. 计算机工程与科学, 2024, 46(12): 2099-2108.
[2]	张宗茂, 董德尊, 王子聪, 常俊胜, 张晓云, 王绍聪. 基于便笺式存储器的向量化SpMV算法的性能评估与分析[J]. 计算机工程与科学, 2024, 46(09): 1521-1528.
[3]	秦文强, 吴仲城, 张俊, 李芳, . 基于异构平台的卷积神经网络加速系统设计[J]. 计算机工程与科学, 2024, 46(01): 12-20.
[4]	李小玲, 方建滨, 马俊, 谭霜, 谭郁松. 基于监督学习的稀疏矩阵自动任务分配[J]. 计算机工程与科学, 2023, 45(05): 782-789.
[5]	孙征征，兰亚柱，付斌章. 一种面向FPGA异构计算的高效能KV加速器[J]. 计算机工程与科学, 2016, 38(08): 1574-1580.
[6]	周磊涛1,2，陶耀东2，刘生1,2，李锁3. 基于FPGA的Systolic乘法技术研究[J]. J4, 2015, 37(09): 1632-1636.
[7]	宋庆增1，张金珠2,武继刚1. 时域有限差分算法的FPGA加速技术研究[J]. J4, 2013, 35(9): 1-6.
[8]	夏〓飞，窦〓勇，雷国庆. 基于FPGA的非编码RNA基因检测算法加速器研究[J]. J4, 2011, 33(12): 153-158.
[9]	冯丹余红梅刘景宁童薇. 基于FPGA的磁盘阵列校验卡的设计与实现[J]. J4, 2007, 29(2): 107-109.