• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (7): 1170-1180.

• 高性能计算 • 上一篇    下一篇

基于Tensor Cores的新型GPU架构的高性能Cholesky分解

石璐,邹高远,伍思琦,张少帅   

  1. (电子科技大学计算机科学与工程学院(网络空间安全学院),四川 成都 611731)
  • 收稿日期:2024-11-04 修回日期:2024-12-03 出版日期:2025-07-25 发布日期:2025-08-25

High performance Cholesky factorization on emerging GPU architectures using Tensor Cores

SHI Lu,ZOU Gaoyuan,WU Siqi,ZHANG Shaoshuai   

  1. (School of Computer Science and Engineering(School of Cyberspace Science and Technology),
    University of Electronic Science and Technology of China,Chengdu 611731,China)
  • Received:2024-11-04 Revised:2024-12-03 Online:2025-07-25 Published:2025-08-25

摘要: 稠密矩阵乘法(GEMMs)在Tensor Cores上可以实现高度优化。然而,现有的Cholesky分解的实现由于其有限的并行性无法达到Tensor Cores大部分的峰值性能。研究使用一种递归Cholesky分解的算法,通过将对角线块的递归细分,将原本的对称矩阵秩K更新(SYRK)和三角方程组求解(TRSM)操作转化为大量的通用矩阵乘法(GEMMs),从而更充分地发挥 Tensor Cores 的峰值性能。实验结果表明,提出的递归Cholesky分解算法在FP32和FP16上分别比MAGMA/cuSOLVER算法提高了1.72倍和1.62倍。

关键词: Cholesky分解, 高性能计算, 数值线性代数, 通用图形处理器(GPGPU)

Abstract: The general matrixmatrix multiplications (GEMMs) can achieve highly optimized performance on Tensor Cores.However,due to its limited parallelism,the existing implementations of Cholesky factorization fail to reach most of the peak performance of Tensor Cores.This paper studies a recursive Cholesky factorization algorithm that recursively subdivides diagonal blocks,generating a large number of GEMMs operations between non-diagonal blocks.This algorithm enables the extraction of a higher proportion of the peak performance of Tensor Cores for internal symmetric Rank-K update (SYRK) and triangular solve matrix (TRSM) operations.Experimental results show that the recursive Cholesky decomposition algorithm proposed in this paper achieves speedups of 1.72× and 1.62× compared to the MAGMA/cuSOLVER algorithms on FP32 and FP16,respectively.


Key words: Cholesky factorization, high performance computing, numerical linear algebra, general-purpose computing on graphics processing units(GPGPU)