• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (07): 1175-1184.

• 高性能计算 • 上一篇    下一篇

一种不规则稀疏矩阵的SpMV方法

施禹1,董攀1,张利军2   

  1. (1.国防科技大学计算机学院,湖南 长沙 410073;2.中国人民解放军32228部队24分队,福建 福州 350101)
  • 收稿日期:2023-11-07 修回日期:2023-12-01 接受日期:2024-07-25 出版日期:2024-07-25 发布日期:2024-07-18
  • 基金资助:
    国防科技重点实验室稳定支持基金(WDZC20235250111);国家自然科学基金(62002371);国防科技大学基金(ZK21-17)

An irregular sparse matrix SpMV method

SHI Yu1,DONG Pan1,ZHANG Li-jun2   

  1. (1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
    2.24 Brigade,32228 Troop,Chinese Peoples Liberation Army,Fuzhou 350101,China)
  • Received:2023-11-07 Revised:2023-12-01 Accepted:2024-07-25 Online:2024-07-25 Published:2024-07-18

摘要: 稀疏矩阵-向量乘法SpMV是高性能计算领域的关键算子之一,在新兴的深度学习领域中有着重要应用。现有SpMV算子通常采用行列相等的稀疏矩阵,而对于不规则形状稀疏矩阵(行数与列数不等)的研究仍存在空缺,值得进一步深入探讨。相比于行列相等的稀疏矩阵,不规则形状稀疏矩阵凭借其行数与列数不对等的稀疏特点具有进一步优化的空间。因此,针对这种行数与列数不对等的不规则形状稀疏矩阵建立SpMV性能模型,分析得到其出现性能瓶颈的原因在于缓存和内存之间数据交互的带宽不足。同时做了以下2个方面的优化工作:(1)基于常用稀疏矩阵CSR存储格式,提出新型RCSR存储格式,其针对CSR存储格式中一个制约性能的数组进行了变换和压缩,使得SpMV更加高效;(2)结合国产处理器的SIMD指令扩展设计了基于RCSR格式的SpMV优化算法。在国产飞腾处理器上分别使用规则和不规则稀疏矩阵进行测试,在规则稀疏矩阵的情况下,通过采用RCSR存储格式和SIMD加速指令集,以GFLOPS为性能指标,实现了平均83.35%的性能提升;在不规则稀疏矩阵的情况下,性能提升与行列比相关,在行列不对等加剧时,具有更为明显的优化效果。

关键词: 稀疏矩阵, 不规则矩阵, 向量乘法, 多核性能, 性能优化

Abstract: Sparse matrix-vector multiplication (SpMV) is one of the key operators in the field of high performance computing and also has significant applications in emerging deep learning domains. Existing  research on SpMV often focuses on square sparse matrices, while there is still a lack of in-depth exploration for irregularly shaped sparse matrices (with unequal numbers of rows and columns). The characteristic of unequal numbers of rows and columns results in different storage features for these sparse matrices compared to square sparse matrices, leaving room for further optimization. Therefore, this paper establishes an SpMV performance model for irregularly shaped sparse matrices with unequal rows and columns, and analyzes that the performance bottleneck is caused by insufficient bandwidth for data exchange between cache and memory. At the same time, this paper carried out the following two optimization tasks: (1) Based on the commonly used CSR storage format for sparse matrices, a new RCSR storage format is proposed, which transforms and compresses a performance-limiting array in the CSR storage format, making SpMV more efficient; (2) An optimized SpMV algorithm based on the RCSR format is designed in conjunction with the SIMD instruction set extension of domestic processors. This paper tests regular and irregular sparse matrices on domestic Phytium processors. For regular sparse matrices, the comprehensive application of RCSR storage format, SIMD instructions, and OpenMP parallelization technology increases GFLOPS by 83.35% on average. For irregular sparse matrices, the performance improvement is related to the row-to-column ratio, and when the row-to- column ratio is not equal, the optimization effect is more obvious.

Key words: sparse matrix, irregular matrix, vector multiplication, multicore performance, performance optimization