NM-SpMM：面向国产异构向量处理器的半结构化稀疏矩阵乘算法

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (7): 1141-1150.

NM-SpMM：面向国产异构向量处理器的半结构化稀疏矩阵乘算法

姜晶菲,何源宏,许金伟,许诗瑶,钱希福

(国防科技大学计算机学院并行与分布计算全国重点实验室，湖南长沙 410073)

收稿日期:2023-11-07 修回日期:2023-12-15 出版日期:2024-07-25 发布日期:2024-07-18

NM-SpMM:A semi-structured sparse matrix multiplication algorithm for domestic heterogeneous vector processors

JIANG Jing-fei,HE Yuan-hong,XU Jin-wei,XU Shi-yao,QIAN Xi-fu

(National Key Laboratory of Parallel and Distributed Computing,College of Computer Science and Technology,
National University of Defense Technology,Changsha 410073,China)

Received:2023-11-07 Revised:2023-12-15 Online:2024-07-25 Published:2024-07-18

摘要/Abstract

摘要： 深度神经网络在自然语言处理、计算机视觉等领域取得了优异的成果，由于智能应用处理数据规模的增长和大模型的快速发展，对深度神经网络的推理性能要求越来越高，N∶M半结构化稀疏化技术成为平衡算力需求和应用效果的热点技术之一。国产异构向量处理器FT-M7032为智能模型处理中的数据并行和指令并行开发提供了较大空间。针对N∶M半结构化稀疏模型计算稀疏模式多样性，提出了一种面向FT-M7032的可灵活配置的稀疏矩阵乘算法NM-SpMM。NM-SpMM设计了一种高效的压缩偏移地址稀疏编码格式COA，避免了半结构化参数配置对稀疏数据访存计算的影响。基于COA编码，NM-SpMM对不同维度稀疏矩阵计算进行了细粒度优化。在FT-M7032单核上的实验结果表明，相较于稠密矩阵乘，NM-SpMM能获得1.73~21.00倍的加速，相较于采用CuSPARSE稀疏计算库的NVIDIA V100 GPU，能获得0.04~1.04倍的加速。

关键词: 深度神经网络, 图形处理器, 向量处理器, 稀疏矩阵乘, 流水线

Abstract: Deep neural networks have achieved excellent results in natural language processing, computer vision and other fields. Due to the growth of the scale of data processed by intelligent applications and the rapid development of large models, the inference performance of deep neural networks is increasingly demanding. N∶M semi-structured sparse scheme has become one of the hot technologies to balance the computing power demand and application effect. The domestic heterogeneous vector processor FT-M7032 provides more space for data parallelism and instruction parallelism development in intelligent model processing. In order to address the challenges of N∶M semi-structured sparse model computation with various sparse patterns, a flexible configurable sparse matrix multiplication algorithm NM-SpMM is proposed for FT-M7032. NM-SpMM designs an efficient compressed offset address sparse encoding format COA, which avoids the impact of semi-structured parameter configuration on sparse data access. Based on the COA, NM-SpMM performs fine-grained optimization of sparse matrix multiplication in different dimensions. The experimental results on FT-M7032 single core show that NM-SpMM can obtain 1.73~21.00 times speedup compared to dense matrix multiplication, and 0.04~1.04 times speedup compared to NVIDIA V100 GPU with CuSPARSE.

Key words: deep neural network, graphics processing unit, vector processor, sparse matrix multiplication, pipeline ,

姜晶菲, 何源宏, 许金伟, 许诗瑶, 钱希福. NM-SpMM：面向国产异构向量处理器的半结构化稀疏矩阵乘算法[J]. 计算机工程与科学, 2024, 46(7): 1141-1150.

JIANG Jing-fei, HE Yuan-hong, XU Jin-wei, XU Shi-yao, QIAN Xi-fu. NM-SpMM:A semi-structured sparse matrix multiplication algorithm for domestic heterogeneous vector processors[J]. Computer Engineering & Science, 2024, 46(7): 1141-1150.

[1]	石璐, 邹高远, 伍思琦, 张少帅. 基于Tensor Cores的新型GPU架构的高性能Cholesky分解[J]. 计算机工程与科学, 2025, 47(7): 1170-1180.
[2]	彭林, 张鹏, 陈俊峰, 唐滔, 黄春. 基于监督学习的稀疏矩阵乘算法优选[J]. 计算机工程与科学, 2025, 47(3): 381-391.
[3]	刘仲, 李程, 田希, 刘胜, 邓让钰, 钱程东. MVSim：面向VLIW多核向量处理器的快速、可扩展和精确的体系结构模拟器[J]. 计算机工程与科学, 2024, 46(2): 191-199.
[4]	孙庆骁, 刘轶, 杨海龙, 王一晴, 贾婕, 栾钟治, 钱德沛. GNNSched：面向GPU的图神经网络推理任务调度框架[J]. 计算机工程与科学, 2024, 46(1): 1-11.
[5]	吴超, 卫谦, 周俊伟, 李会民, 孙广中. 基于异构计算平台的背景噪声预处理并行算法[J]. 计算机工程与科学, 2023, 45(10): 1711-1719.
[6]	陈虎, 韩建国. GPU上典型存储器难散列函数的优化[J]. 计算机工程与科学, 2020, 42(10高性能专刊): 1905-1912.
[7]	贾朝阳, 张敦博, 王琼, 沈立. 一种高效的压缩Page Walk Cache结构[J]. 计算机工程与科学, 2020, 42(09): 1521-1528.
[8]	方程，邢座程，陈顼颢，张洋. 一种基于GPU的高性能稀疏卷积神经网络优化[J]. 计算机工程与科学, 2018, 40(12): 2103-2111.
[9]	晏敏1，何欣1，李沙1，祝龙1，赵丽2. 基于一阶泰勒级数查表法单精度倒数的设计与实现[J]. 计算机工程与科学, 2017, 39(07): 1269-1272.
[10]	鲁庆男,刘仲. 一种基于Matrix的QR分解向量化方法[J]. J4, 2016, 38(02): 210-216.
[11]	刘旭江，徐圆，齐宏亮，洪虹，周凌宏. 基于高斯拉普拉斯的层次剥离体绘制[J]. J4, 2014, 36(6): 1148-1153.
[12]	黄亮，秦信刚，武玲娟，熊庭刚. 一种面向55 nm工艺的可扩展统一架构图形处理器设计与实现[J]. J4, 2014, 36(12): 2418-2423.
[13]	张连伟1，2，刘大学2，刘肖琳2，李焱2，徐昕2，贺汉根2. 基于图形处理器的点云快速光顺[J]. J4, 2011, 33(4): 86-92.
[14]	刘耀林邱飞岳王丽萍. 基于GPU的图像快速旋转算法的研究及实现[J]. J4, 2008, 30(6): 48-50.
[15]	刘杰迟利华胡庆丰李晓梅. 并行计算稀疏矩阵乘以向量的负载平衡算法[J]. J4, 2006, 28(3): 76-77.