MIMD众核架构ILU分解并行算法优化研究

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (9): 1544-1554.

MIMD众核架构ILU分解并行算法优化研究

石永振1,2，莫淏天1,2，胡星宇1,2，刘杰1,2，王庆林1,2

(1.国防科技大学高端装备数字化软件湖南省重点实验室,湖南长沙 410073；
2.国防科技大学并行与分布计算全国重点实验室,湖南长沙 410073)

收稿日期:2024-05-21 修回日期:2024-09-15 出版日期:2025-09-25 发布日期:2025-09-22
基金资助:
国家重点研发计划（2023YFA1011704，2021YFBO300101)

Optimization of ILU decomposition parallel algorithm on MIMD many-core architecture

SHI Yongzhen1,2,MO Haotian1,2,HU Xingyu1,2,LIU Jie1,2,WANG Qinglin1,2

(1.Laboratory of Digitizing Software for Frontier Equipment,National University of Defense Technology,Changsha 410073;
2.National Key Laboratory of Parallel and Distributed Computing,
National University of Defense Technology,Changsha 410073,China)

Received:2024-05-21 Revised:2024-09-15 Online:2025-09-25 Published:2025-09-22

摘要/Abstract

摘要： ILU分解被广泛应用于求解大规模稀疏线性系统，能够有效减少迭代次数、提高求解效率，但限于线性系统的数据依赖性和分解过程中计算访存的不规则，较难进行高效的并行优化。多指令多数据(MIMD)众核架构中众多并行计算线程可以执行不同的指令，对于控制流不规则的算法具有天然的适应性。基于MIMD众核架构PEZY-SC3s处理器开展ILU分解并行算法优化研究，提出了一种面向MIMD架构的ILU并行算法，并采用基于图着色的并行性优化、基于向量单元的访存优化、基于线程分组的负载平衡优化以及基于片上局部存储的数据局部性优化等措施来优化算法性能。实验结果表明，所提ILU并行分解算法与Intel Xeon 4314 CPU上MKL实现和NVIDIA A30 GPU上cuSPARSE实现相比，分别获得了16.70与1.39的平均加速比。

关键词: ILU分解, MIMD众核架构, 并行计算

Abstract: ILU (Incomplete LU) factorization is widely used in solving large-scale sparse linear systems. It can effectively reduce the number of iterations and improve solving efficiency. However, due to the data dependence of linear systems and the irregularity of computation and memory access during the decomposition process, it is difficult to perform efficient parallel optimization. In the multiple instruction multiple data (MIMD) many-core architecture, numerous parallel computing threads can execute different instructions, which has a natural adaptability to algorithms with irregular control flow. This paper conducts research on the parallel algorithm optimization of ILU factorization on the MIMD many-core architecture PEZY-SC3s processor, proposes an ILU parallel algorithm for the MIMD architecture, and adopts measures such as graph coloring-based parallelism optimization, vector unit-based memory access optimization, thread grouping-based load balancing optimization, and on-chip local storage-based data locality optimization to optimize the algorithm performance. Experimental results show that the proposed ILU parallel factorization algorithm achieves an average speedup of 16.70 and 1.39 compared with the MKL implementation on Intel Xeon 4314 CPU and the cuSPARSE implementation on NVIDIA A30 GPU, respectively.

Key words: incomplete LU factorization, MIMD many-core architecture, parallel computing

石永振1, 2, 莫淏天1, 2, 胡星宇1, 2, 刘杰1, 2, 王庆林1, 2. MIMD众核架构ILU分解并行算法优化研究[J]. 计算机工程与科学, 2025, 47(9): 1544-1554.

SHI Yongzhen1, 2, MO Haotian1, 2, HU Xingyu1, 2, LIU Jie1, 2, WANG Qinglin1, 2. Optimization of ILU decomposition parallel algorithm on MIMD many-core architecture[J]. Computer Engineering & Science, 2025, 47(9): 1544-1554.