• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2024, Vol. 46 ›› Issue (06): 984-992.

• High Performance Computing • Previous Articles     Next Articles

Dense linear solver on many-core CPUs:Characterization and optimization

FU Xiao1,SU Xing1,DONG De-zun1,QIAN Cheng-dong2   

  1. (1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
    2.Phytium Technology Co.,Ltd.,Tianjin 300459,China)
  • Received:2023-09-25 Revised:2023-11-22 Accepted:2024-06-25 Online:2024-06-25 Published:2024-06-17

Abstract: The dense linear solver plays a vital role in high-performance computing and machine learning. Typical parallel implementations are built upon the well-known fork-join or task-based programming model. Though mainstream dense linear algebra libraries adopting the fork-join paradigm can shift most of the computation to well-tuned and high-performance BLAS 3 routines, they fail to exploit many-core CPUs efficiently due to the rigid execution stream of fork-join. While open-source implementations employing the task-based paradigm can provide more promising performance thanks to the models malleability and better load balance, they still leave much room for optimization on many-core platforms, especially for medium-sized matrices. In this paper, a quantitative characterization of the dense linear solver is carried out to locate performance bottlenecks and a series of optimizations is proposed to deliver higher performance. Specifically, idle threads are reduced by merging LU factorization with the following lower triangular solver to improve parallelism. Moreover, duplicated matrix packing operations are reduced to lower memory overhead. Performance evaluation is conducted on two modern many-core platform, Intel Xeon Gold 6252N (48 cores) and HiSilicon Kunpeng 920 (64 cores). Evaluation results show that our optimized solver outperforms the state-of-the-art open-source implementation by a factor up to 10.05% (Xeon) and 13.63% (Kunpeng 920) on the two platforms, respectively.

Key words: dense linear solver, LU factorization, fork-join model, task-based model, many-core CPU