• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (09): 1521-1528.

• 高性能计算 • 上一篇    下一篇

基于便笺式存储器的向量化SpMV算法的性能评估与分析

张宗茂,董德尊,王子聪,常俊胜,张晓云,王绍聪


  

  1. (国防科技大学计算机学院,湖南  长沙 410073) 

  • 收稿日期:2023-10-23 修回日期:2023-11-22 接受日期:2024-09-25 出版日期:2024-09-25 发布日期:2024-09-19
  • 基金资助:
    湖南省杰出青年科学基金(2021JJ10050);国防科技大学科研计划项目(ZK22-23)

Performance evaluation and analysis of vectorized SpMV algorithm based on scratchpad memory

ZHANG Zong-mao,DONG De-zun,WANG Zi-cong,CHANG Jun-sheng,ZHANG Xiao-yun,WANG Shao-cong   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2023-10-23 Revised:2023-11-22 Accepted:2024-09-25 Online:2024-09-25 Published:2024-09-19

摘要: 便笺式存储器是一种结构简单、访问延迟固定且软件可直接控制的片上高速存储,在现代处理器设计中得到了广泛应用。稀疏矩阵向量乘SpMV是高性能计算、人工智能等应用领域重要的内核计算函数之一。在传统多级Cache处理器中,SpMV算法计算过程中对稠密输入向量的不规则访问操作会导致大量Cache访问请求失效,从而影响SpMV算法执行效率。为了评估便笺式存储器对SpMV向量算法的性能影响,使用ARM SVE指令对基于CSR格式的SpMV算法向量化,并将算法中的热点数据即稠密输入向量存储在便笺式存储器中,在集成了便笺式存储器的ARM架构处理器中对SpMV向量算法进行了性能分析。在gem5模拟器中针对来自真实应用程序的2 562个稀疏矩阵进行了实验。实验结果表明,集成了便笺式存储器的处理器与传统多级Cache处理器相比,针对向量化SpMV算法能够实现的最大加速比为7.45,平均加速比为1.11。

关键词: 稀疏矩阵向量乘, 便笺式存储器, CSR, ARM SVE

Abstract: Scratchpad memory (SPM), as an on-chip high-speed memory with a simple structure, fixed access latency, and direct software control, has been widely used in modern processor design. Sparse matrix vector multiplication (SpMV) is one of the critical kernel computation functions in high performance computing, artificial intelligence, and other application domains. In traditional multi-level cache processors, the irregular access operations of dense input vectors during the computation of the SpMV algorithm often lead to a significant number of cache misses, thereby affecting the execution efficiency of the SpMV algorithm. To evaluate the performance impact of scratchpad memory on the SpMV vector algorithm, this paper utilizes ARMs scalable vector extension (SVE) instructions to vectorize the SpMV algorithm based on the compressed sparse row (CSR) format. It stores the hot data, namely the dense input vectors, in the scratchpad memory and conducts a performance analysis of the SpMV vector algorithm on ARM-based processors integrated with scratchpad memory. This paper conducts experiments on 2 562 sparse matrices from real-world applications using the gem5 simulator. The experimental results show that, compared to traditional processor architectures, running the SpMV vector algorithm on the processor architecture integrated with scratchpad memory can achieve a maximum speedup of 7.45 times and an average speedup of 1.11 times.


Key words: sparse matrix vector multiplication, scratchpad memory, compressed sparse row(CSR), ARM scalable vector extension(SVE)