• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (06): 984-992.

• 高性能计算 • 上一篇    下一篇

面向众核CPU的稠密线性求解器性能评测与优化

付晓1,苏醒1,董德尊1,钱程东2   

  1. (1.国防科技大学计算机学院,湖南 长沙 410073;2.飞腾信息技术有限公司,天津 300459)

  • 收稿日期:2023-09-25 修回日期:2023-11-22 接受日期:2024-06-25 出版日期:2024-06-25 发布日期:2024-06-17
  • 基金资助:
    国家重点研发计划(2022YFB4501702);湖南省自然科学基金杰出青年基金(2021JJ10050)

Dense linear solver on many-core CPUs:Characterization and optimization

FU Xiao1,SU Xing1,DONG De-zun1,QIAN Cheng-dong2   

  1. (1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
    2.Phytium Technology Co.,Ltd.,Tianjin 300459,China)
  • Received:2023-09-25 Revised:2023-11-22 Accepted:2024-06-25 Online:2024-06-25 Published:2024-06-17

摘要: 稠密线性求解器在高性能计算和机器学习等领域扮演着重要的角色。其典型的并行算法实现通常构建在著名的fork-join或task-based编程模型之上。尽管采用fork-join模型的主流稠密线性代数库能将大部分的计算转移到高度优化、高性能的BLAS 3例程上,由于fork-join不灵活的执行流,它们仍然未能高效地利用众核CPU的计算资源。采用task-based编程模型的开源库能实现更加灵活、负载更均衡的算法,因此能获得明显的性能提升。然而,在众核CPU平台上,尤其是对于中等矩阵规模的问题而言,它们仍然有较大的优化空间。对稠密线性求解器的性能进行了全面的测评,以定位性能瓶颈,并提出了2种优化策略,以提高程序性能。具体地,通过重叠LU分解和下三角求解的计算过程,减少同步开销线程的空等,从而提高算法的并行性;进一步通过减少冗余的矩阵打包操作,降低算法的访存开销。分别在2个主流的众核CPU平台(Intel Xeon Gold 6252N(48核)和HiSilicon Kunpeng 920(64核))上进行了性能评估。实验结果表明,该优化的稠密线性求解器在上述两个CPU平台上,相比最佳开源实现分别取得了10.05%(Xeon)和13.63%(Kunpeng 920)的性能提升。

关键词: 稠密线性求解器, LU分解, fork-join模型, task-based模型, 众核CPU

Abstract: The dense linear solver plays a vital role in high-performance computing and machine learning. Typical parallel implementations are built upon the well-known fork-join or task-based programming model. Though mainstream dense linear algebra libraries adopting the fork-join paradigm can shift most of the computation to well-tuned and high-performance BLAS 3 routines, they fail to exploit many-core CPUs efficiently due to the rigid execution stream of fork-join. While open-source implementations employing the task-based paradigm can provide more promising performance thanks to the models malleability and better load balance, they still leave much room for optimization on many-core platforms, especially for medium-sized matrices. In this paper, a quantitative characterization of the dense linear solver is carried out to locate performance bottlenecks and a series of optimizations is proposed to deliver higher performance. Specifically, idle threads are reduced by merging LU factorization with the following lower triangular solver to improve parallelism. Moreover, duplicated matrix packing operations are reduced to lower memory overhead. Performance evaluation is conducted on two modern many-core platform, Intel Xeon Gold 6252N (48 cores) and HiSilicon Kunpeng 920 (64 cores). Evaluation results show that our optimized solver outperforms the state-of-the-art open-source implementation by a factor up to 10.05% (Xeon) and 13.63% (Kunpeng 920) on the two platforms, respectively.

Key words: dense linear solver, LU factorization, fork-join model, task-based model, many-core CPU