High performance Cholesky factorization on emerging GPU architectures using Tensor Cores

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (7): 1170-1180.

• High Performance Computing • Previous Articles Next Articles

High performance Cholesky factorization on emerging GPU architectures using Tensor Cores

SHI Lu,ZOU Gaoyuan,WU Siqi,ZHANG Shaoshuai

(School of Computer Science and Engineering（School of Cyberspace Science and Technology）,
University of Electronic Science and Technology of China,Chengdu 611731,China)

Received:2024-11-04 Revised:2024-12-03 Online:2025-07-25 Published:2025-08-25

Abstract

Abstract: The general matrixmatrix multiplications (GEMMs) can achieve highly optimized performance on Tensor Cores.However,due to its limited parallelism,the existing implementations of Cholesky factorization fail to reach most of the peak performance of Tensor Cores.This paper studies a recursive Cholesky factorization algorithm that recursively subdivides diagonal blocks,generating a large number of GEMMs operations between non-diagonal blocks.This algorithm enables the extraction of a higher proportion of the peak performance of Tensor Cores for internal symmetric Rank-K update (SYRK) and triangular solve matrix (TRSM) operations.Experimental results show that the recursive Cholesky decomposition algorithm proposed in this paper achieves speedups of 1.72× and 1.62× compared to the MAGMA/cuSOLVER algorithms on FP32 and FP16,respectively.

Key words: Cholesky factorization, high performance computing, numerical linear algebra, general-purpose computing on graphics processing units(GPGPU)

SHI Lu, ZOU Gaoyuan, WU Siqi, ZHANG Shaoshuai. High performance Cholesky factorization on emerging GPU architectures using Tensor Cores[J]. Computer Engineering & Science, 2025, 47(7): 1170-1180.

[1]	LI Junzhe, FU Zhenxin, YANG Honghui, MA Yinping, LI Ruomiao, FAN Chun, . Design and implementation of a cross-cluster data migration system for computational networks [J]. Computer Engineering & Science, 2025, 47(5): 775-786.
[2]	JIA Chunbo, CHEN Guang, YAO Xinan, LI Baofeng. High-power multiphase power supply technology based on domestic devices [J]. Computer Engineering & Science, 2025, 47(4): 592-600.
[3]	WANG Dong, LIU Zhuang, HUANG Xiaomeng. An efficient parallel computing framework for earth system models [J]. Computer Engineering & Science, 2025, 47(10): 1711-1925.
[4]	ZHANG Jianmin, XU Weikang, LIU Jinjin, LI Tiejun. Research advances in acceleration methods for particle transport non-deterministic simulation [J]. Computer Engineering & Science, 2025, 47(1): 1-9.
[5]	SUN Yan, ZHANG Jian-min, LI Yuan, SUN Shun-yu. Analysis and evaluation of congestion control in interconnection networks for high performance computing [J]. Computer Engineering & Science, 2024, 46(2): 209-216.
[6]	ZHU Wen-long, JIANG Jia-zhi, HUANG Dan, XIAO Nong. ParM: A heterogeneous programming model for domestic processors [J]. Computer Engineering & Science, 2023, 45(9): 1521-1531.
[7]	WU Tie-bin, GUO Feng, WANG Di. A survey of core computing architecture of high performance processors for exascale computing [J]. Computer Engineering & Science, 2023, 45(5): 761-771.
[8]	SHI De-jun, LI Hong-liang, HU Shu-kai . A Clos network based high-radix router structure [J]. Computer Engineering & Science, 2023, 45(12): 2099-2112.
[9]	ZHANG Tian-yang, CHI Cheng-yue, GUO Wu, GAO Yi-qin, WEN Min-hua, WEI Jian-wen . Key techniques and practice on managing multi-site HPC clusters for university campus [J]. Computer Engineering & Science, 2023, 45(12): 2135-2145.
[10]	XIAO Tiao-jie, ZHOU Feng, ZHENG Xuan-yu, LIU Jian, CHEN Lin, LIU Jie, YI Ming-kuan, CHEN Xu-guang, GONG Chun-ye, YANG Bo, GAN Xin-biao, LI Sheng-guo, ZUO Ke, . Large-scale 3D electromagnetic modeling in frequency domain using integration equation method [J]. Computer Engineering & Science, 2023, 45(11): 1901-1910.
[11]	CHEN Feng-xian. Cluster job runtime prediction based on NR-Transformer [J]. Computer Engineering & Science, 2022, 44(7): 1181-1190.
[12]	WU Jun-nan, OU Yang, LI Yan. Design and implementation of a high performance computing user organization management system based on LAMP#br# #br# [J]. Computer Engineering & Science, 2021, 43(2): 235-241.
[13]	LIU Jie, GONG Chun-ye, YANG Bo, GUO Xiao-wei, GAN Xin-biao, LI Sheng-guo, LI Chao, CHEN Xu-guang, XIAO Tiao-jie, MU Li-an, SONG Min, ZHAO Dong-yong, JU Yu-zhong. YH-ACT：Parallel analysis code of thermohydraulics [J]. Computer Engineering & Science, 2021, 43(1): 58-69.
[14]	LI Zhe, TAN Yusong, LI Bao, YU Jie. Cold start optimization on function computing for high performance computing [J]. Computer Engineering & Science, 2020, 42(11): 1973-1980.
[15]	LI Qiong, SONG Zhen-long, YUAN Yuan, XIE Xu-chao. A regional shared and high concurrent storage architecture based on NVMeoF storage pool [J]. Computer Engineering & Science, 2020, 42(10高性能专刊): 1711-1719.

High performance Cholesky factorization on emerging GPU architectures using Tensor Cores

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 15

Recommended Articles

Metrics

Comments