• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2012, Vol. 34 ›› Issue (7): 78-83.

• 论文 • 上一篇    下一篇

基于CUDA编程模型的稀疏对角矩阵向量乘优化

秦晋,龚春叶,胡庆丰,刘杰   

  1. (国防科学技术大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2010-05-26 修回日期:2010-08-20 出版日期:2012-07-25 发布日期:2012-07-25
  • 基金资助:

    国家自然科学基金资助项目(60673150,60970033);国家863计划资助项目(2008AA01Z137)

Optimization of Sparse Diagonal MatrixVector Multiplication Based on the CUDA Program Model

QIN Jin,GONG Chunye,HU Qingfeng,LIU Jie   

  1. (School of Computer Science,National University of Defense Technology,Changsha 410073,China)
  • Received:2010-05-26 Revised:2010-08-20 Online:2012-07-25 Published:2012-07-25

摘要:

稀疏矩阵向量乘是很多科学计算问题中的核心问题。本文针对稀疏对角矩阵,在DIA存储格式的基础上,设计了一种新型压缩存储格式CDIA,结合CUDA编程模型的特点,在计算线程上进行了细粒度的任务分配,同时为满足CUDA对存储器的合并访问要求,将压缩矩阵做了相应的转置处理,设计了细粒度算法与程序,并根据稀疏矩阵向量乘特点,做了相应的程序优化。实验数据显示,这种存储格式能够很好地发挥CUDA在数据处理方面的优势,在测试数据中,最高获得了单精度39.6 Gflop/s和双精度19.6 Gflop/s的浮点计算性能,性能在Nathan Bell和Michael Garland的基础上分别提高了7.6%和17.4%。

关键词: GPU, CDIA, CUDA, 稀疏矩阵向量乘

Abstract:

Sparse matrixvector multiplication is often an important computational kernel in many scientific applications. This paper faces the ndiagonal sparse matrix, uses the CUDA program model and describes a new compress format of sparse matrix based on the DIA compress format (CDIA), and gives each thread finegrained task distribution. In order to fulfill the characteristics of the align access of memory in CUDA, we transpose the compress matrix and design a  finegrained algorithm and program and do  some optimization to the program. In the data experiment, our best implementation achieves up to 39.6Gflop/s in singleprecision and 19.6Gflop/s in doubleprecision, and enhances the performance by about 7.6% and 17.4% that of Nathan Bell’s and Michael Garland’s respectively.

Key words: GPU;CDIA;CUDA;sparse matrixvector multiplication