• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (3): 400-411.

• 高性能计算 • 上一篇    下一篇

基于“天河二号”聚合通信卸载特性的 MPI_Barrier优化

朱琦1,2,3,戴艺1,彭晋韬1,2,3,谢旻1,梁崇山1,刘鹏1,2,3,杨博1,刘杰1,2,3   

  1. (1.国防科技大学计算机学院,湖南 长沙 410073; 
    2.国防科技大学高端装备数字化软件湖南省重点实验室,湖南 长沙 410073;
    3.国防科技大学并行与分布计算全国重点实验室,湖南 长沙 410073)


  • 收稿日期:2023-12-26 修回日期:2024-03-22 出版日期:2025-03-25 发布日期:2025-04-01
  • 基金资助:
    国家自然科学基金(62272476);国家重点研发计划(2021YFBO300101);国家自然科学基金重点项目(U22B2005);并行与分布处理国家重点实验室基金(2021-KJWPDL-08)

Optimization of MPI_Barrier based on the offloading characteristics of Tianhe-2

ZHU Qi1,2,3,DAI Yi1,PENG Jintao1,2,3,XIE Min1,LIANG Chongshan1,LIU Peng1,2,3,YANG Bo1,LIU Jie1,2,3   

  1. (1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
    2.Hunan Key Laboratory of Digitizing Software for Frontier Equipment,
    National University of Defense Technology,Changsha 410073;
    3.National Key Laboratory of Parallel and Distributed Computing,
    National University of Defense Technology,Changsha 410073,China)
  • Received:2023-12-26 Revised:2024-03-22 Online:2025-03-25 Published:2025-04-01

摘要: Barrier作为消息传递接口MPI程序的基本操作,是确保程序正确执行的重要机制之一。目前已有的Barrier实现方案主要存在2个缺陷:首先,节点间同步存在大量冗余的数据路径传输开销;其次,节点内同步存在大量缓存失效的情况。为解决这些性能限制,针对“天河二号”定制网络TH-Express聚合通信卸载特性,提出了基于GLEX NIC的Barrier加速和共享内存标志位重排列2种优化技术,有效减少了节点间同步开销,提高了节点内基于共享内存的同步效率。基于上述优化方法,重新设计了MPI_Barrier算法,并将其集成到MPI通信库中,并在国家超级计算长沙中心通过运行微基准测试程序和实际应用程序对所提优化方法进行性能测试,规模达到7 168个节点。实验结果表明,优化后的MPI_Barrier集合操作获得了1.3~14.5倍的加速,并在应用级真实负载评测中,性能提升高达54%。

关键词: MPI, Barrier, 大规模并行应用, NIC聚合通信卸载

Abstract: Barrier, as a fundamental operation in message passing interface (MPI) programs, is one of the critical mechanisms ensuring the correct execution of programs. Existing Barrier implementation schemes primarily suffer from two defects: firstly, there is significant redundant data path transmission overhead during inter-node synchronization; secondly, there are numerous cache misses during intra-node synchronization. To address these performance limitations, this paper proposes two optimization techniques tailored for the aggregate communication offload features of the Tianhe-2 customized network, TH-Express: Barrier acceleration based on GLEX  NIC and shared memory flag bits rearrangement. These techniques effectively reduce the synchronization overhead between nodes and improve the synchronization efficiency within nodes based on shared memory. Based on the aforementioned optimization methods, this paper redesigns the MPI_Barrier algorithm and integrates it into the MPI communication library. Performance tests of the proposed scheme are conducted on micro-benchmark programs and real applications running on the National Supercomputing Center in Changsha, with a scale of up to 7168 nodes. Experimental results show that the optimized MPI_Barrier collective operation achieves a speedup ranging from 1.3 to 14.5 times, and in application-level real-load evaluations, the performance improvement reaches up to 54%.

Key words: massage passing interface(MPI), Barrier, massively parallel applications, NIC collective communication offloading