• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (3): 400-411.

• High Performance Computing • Previous Articles     Next Articles

Optimization of MPI_Barrier based on the offloading characteristics of Tianhe-2

ZHU Qi1,2,3,DAI Yi1,PENG Jintao1,2,3,XIE Min1,LIANG Chongshan1,LIU Peng1,2,3,YANG Bo1,LIU Jie1,2,3   

  1. (1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
    2.Hunan Key Laboratory of Digitizing Software for Frontier Equipment,
    National University of Defense Technology,Changsha 410073;
    3.National Key Laboratory of Parallel and Distributed Computing,
    National University of Defense Technology,Changsha 410073,China)
  • Received:2023-12-26 Revised:2024-03-22 Online:2025-03-25 Published:2025-04-01

Abstract: Barrier, as a fundamental operation in message passing interface (MPI) programs, is one of the critical mechanisms ensuring the correct execution of programs. Existing Barrier implementation schemes primarily suffer from two defects: firstly, there is significant redundant data path transmission overhead during inter-node synchronization; secondly, there are numerous cache misses during intra-node synchronization. To address these performance limitations, this paper proposes two optimization techniques tailored for the aggregate communication offload features of the Tianhe-2 customized network, TH-Express: Barrier acceleration based on GLEX  NIC and shared memory flag bits rearrangement. These techniques effectively reduce the synchronization overhead between nodes and improve the synchronization efficiency within nodes based on shared memory. Based on the aforementioned optimization methods, this paper redesigns the MPI_Barrier algorithm and integrates it into the MPI communication library. Performance tests of the proposed scheme are conducted on micro-benchmark programs and real applications running on the National Supercomputing Center in Changsha, with a scale of up to 7168 nodes. Experimental results show that the optimized MPI_Barrier collective operation achieves a speedup ranging from 1.3 to 14.5 times, and in application-level real-load evaluations, the performance improvement reaches up to 54%.

Key words: massage passing interface(MPI), Barrier, massively parallel applications, NIC collective communication offloading