基于“天河二号”聚合通信卸载特性的 MPI_Barrier优化

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (3): 400-411.

基于“天河二号”聚合通信卸载特性的 MPI_Barrier优化

朱琦1,2,3,戴艺1,彭晋韬1,2,3,谢旻1,梁崇山1,刘鹏1,2,3,杨博1,刘杰1,2,3

(1.国防科技大学计算机学院，湖南长沙 410073；
2.国防科技大学高端装备数字化软件湖南省重点实验室，湖南长沙 410073；
3.国防科技大学并行与分布计算全国重点实验室，湖南长沙 410073)

收稿日期:2023-12-26 修回日期:2024-03-22 出版日期:2025-03-25 发布日期:2025-04-01
基金资助:
国家自然科学基金(62272476);国家重点研发计划(2021YFBO300101);国家自然科学基金重点项目(U22B2005);并行与分布处理国家重点实验室基金(2021-KJWPDL-08)

Optimization of MPI_Barrier based on the offloading characteristics of Tianhe-2

ZHU Qi1,2,3,DAI Yi1,PENG Jintao1,2,3,XIE Min1,LIANG Chongshan1,LIU Peng1,2,3,YANG Bo1,LIU Jie1,2,3

(1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
2.Hunan Key Laboratory of Digitizing Software for Frontier Equipment,
National University of Defense Technology,Changsha 410073;
3.National Key Laboratory of Parallel and Distributed Computing,
National University of Defense Technology,Changsha 410073,China)

Received:2023-12-26 Revised:2024-03-22 Online:2025-03-25 Published:2025-04-01

摘要/Abstract

摘要： Barrier作为消息传递接口MPI程序的基本操作，是确保程序正确执行的重要机制之一。目前已有的Barrier实现方案主要存在2个缺陷：首先，节点间同步存在大量冗余的数据路径传输开销；其次，节点内同步存在大量缓存失效的情况。为解决这些性能限制，针对“天河二号”定制网络TH-Express聚合通信卸载特性，提出了基于GLEX NIC的Barrier加速和共享内存标志位重排列2种优化技术，有效减少了节点间同步开销，提高了节点内基于共享内存的同步效率。基于上述优化方法，重新设计了MPI_Barrier算法，并将其集成到MPI通信库中,并在国家超级计算长沙中心通过运行微基准测试程序和实际应用程序对所提优化方法进行性能测试，规模达到7 168个节点。实验结果表明，优化后的MPI_Barrier集合操作获得了1.3~14.5倍的加速，并在应用级真实负载评测中，性能提升高达54%。

关键词: MPI, Barrier, 大规模并行应用, NIC聚合通信卸载

Abstract: Barrier, as a fundamental operation in message passing interface (MPI) programs, is one of the critical mechanisms ensuring the correct execution of programs. Existing Barrier implementation schemes primarily suffer from two defects: firstly, there is significant redundant data path transmission overhead during inter-node synchronization; secondly, there are numerous cache misses during intra-node synchronization. To address these performance limitations, this paper proposes two optimization techniques tailored for the aggregate communication offload features of the Tianhe-2 customized network, TH-Express: Barrier acceleration based on GLEX NIC and shared memory flag bits rearrangement. These techniques effectively reduce the synchronization overhead between nodes and improve the synchronization efficiency within nodes based on shared memory. Based on the aforementioned optimization methods, this paper redesigns the MPI_Barrier algorithm and integrates it into the MPI communication library. Performance tests of the proposed scheme are conducted on micro-benchmark programs and real applications running on the National Supercomputing Center in Changsha, with a scale of up to 7168 nodes. Experimental results show that the optimized MPI_Barrier collective operation achieves a speedup ranging from 1.3 to 14.5 times, and in application-level real-load evaluations, the performance improvement reaches up to 54%.

Key words: massage passing interface（MPI）, Barrier, massively parallel applications, NIC collective communication offloading

朱琦, 戴艺, 彭晋韬, 谢旻, 梁崇山, 刘鹏, 杨博, 刘杰, . 基于“天河二号”聚合通信卸载特性的 MPI_Barrier优化[J]. 计算机工程与科学, 2025, 47(3): 400-411.

ZHU Qi, DAI Yi, PENG Jintao, XIE Min, LIANG Chongshan, LIU Peng, YANG Bo, LIU Jie, . Optimization of MPI_Barrier based on the offloading characteristics of Tianhe-2[J]. Computer Engineering & Science, 2025, 47(3): 400-411.

[1]	孙浩男, 王飞, 魏迪, 尹万旺, 史俊达 . 一种面向大规模并发的Gatherv优化方法[J]. 计算机工程与科学, 2022, 44(9): 1542-1549.
[2]	葛旭冉, 刘洋, 陈志广, 肖侬. 基于MPI的并行大数据集生成器[J]. 计算机工程与科学, 2022, 44(7): 1152-1161.
[3]	何康, 黄春, 姜浩, 谷同祥, 齐进, 刘杰, . 基于MPI的高精度归约函数设计与实现[J]. 计算机工程与科学, 2021, 43(4): 594-602.
[4]	范培勤, 过武宏, 韩梅, 唐帅, 张驰, . 水声环境特征参数并行预报方法研究[J]. 计算机工程与科学, 2021, 43(11): 1920-1925.
[5]	姜尚志, 唐生林, 高希然, 花嵘, 陈莉, 刘颖. “神威·太湖之光”上Tend_lin应用的并行优化研究[J]. 计算机工程与科学, 2020, 42(10高性能专刊): 1842-1851.
[6]	皇甫永硕,刘杰,龚春叶. 基于二维结构化网格的可压缩流体并行算法研究[J]. 计算机工程与科学, 2017, 39(9): 1602-1609.
[7]	宋梦召，冯仰德. 核辐照损伤金属材料的大规模KMC模拟[J]. 计算机工程与科学, 2017, 39(7): 1211-1218.
[8]	邹佩钢1,2，陈军1. 基于CombBLAS的同辈压力图聚类并行算法的设计与实现[J]. 计算机工程与科学, 2017, 39(3): 424-429.
[9]	李瑞琳1,2，赵永华1，黄小磊2,3. 一种基于MPI的稀疏化局部尺度并行谱聚类算法的研究与实现[J]. J4, 2016, 38(5): 839-847.
[10]	严忻恺，郝子宇，吴东，谢向辉. MPI非阻塞广播算法及性能研究[J]. J4, 2013, 35(9): 20-26.
[11]	姚光超，郑尧，肖利民，阮利. 基于MPI+GPU的哼唱检索系统加速[J]. J4, 2013, 35(11): 168-174.
[12]	徐磊，徐莹，蒋荣琳，张丹丹. GPU集群上的三维UPML-FDTD算法的实现及优化[J]. J4, 2013, 35(11): 160-167.
[13]	辛乃军,陈旭灿,孙海燕,阳柳,罗杰,淡孝强,王霁. 基于GCC的高性能DSP Matrix向量指令集扩展[J]. J4, 2012, 34(1): 58-63.
[14]	杨灿群，杨学军，易会战. 扩展双精度浮点并行计算：MPI方法[J]. J4, 2010, 32(12): 98-101.
[15]	李肯立[1] 杨进[1] 彭成斌[2] 秦云川[1]. 基于MPI＋OpenMP混合模型的并行地震数据处理支撑库的研究[J]. J4, 2007, 29(12): 136-139.