Optimization of MPI_Barrier based on the offloading characteristics of Tianhe-2

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (3): 400-411.

• High Performance Computing • Previous Articles Next Articles

Optimization of MPI_Barrier based on the offloading characteristics of Tianhe-2

ZHU Qi1,2,3,DAI Yi1,PENG Jintao1,2,3,XIE Min1,LIANG Chongshan1,LIU Peng1,2,3,YANG Bo1,LIU Jie1,2,3

(1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
2.Hunan Key Laboratory of Digitizing Software for Frontier Equipment,
National University of Defense Technology,Changsha 410073;
3.National Key Laboratory of Parallel and Distributed Computing,
National University of Defense Technology,Changsha 410073,China)

Received:2023-12-26 Revised:2024-03-22 Online:2025-03-25 Published:2025-04-01

Abstract

Abstract: Barrier, as a fundamental operation in message passing interface (MPI) programs, is one of the critical mechanisms ensuring the correct execution of programs. Existing Barrier implementation schemes primarily suffer from two defects: firstly, there is significant redundant data path transmission overhead during inter-node synchronization; secondly, there are numerous cache misses during intra-node synchronization. To address these performance limitations, this paper proposes two optimization techniques tailored for the aggregate communication offload features of the Tianhe-2 customized network, TH-Express: Barrier acceleration based on GLEX NIC and shared memory flag bits rearrangement. These techniques effectively reduce the synchronization overhead between nodes and improve the synchronization efficiency within nodes based on shared memory. Based on the aforementioned optimization methods, this paper redesigns the MPI_Barrier algorithm and integrates it into the MPI communication library. Performance tests of the proposed scheme are conducted on micro-benchmark programs and real applications running on the National Supercomputing Center in Changsha, with a scale of up to 7168 nodes. Experimental results show that the optimized MPI_Barrier collective operation achieves a speedup ranging from 1.3 to 14.5 times, and in application-level real-load evaluations, the performance improvement reaches up to 54%.

Key words: massage passing interface（MPI）, Barrier, massively parallel applications, NIC collective communication offloading

ZHU Qi, DAI Yi, PENG Jintao, XIE Min, LIANG Chongshan, LIU Peng, YANG Bo, LIU Jie, . Optimization of MPI_Barrier based on the offloading characteristics of Tianhe-2[J]. Computer Engineering & Science, 2025, 47(3): 400-411.

[1]	ZANG Zhao-hu, LI Chen, WANG Yao-hua, CHEN Xiao-wen, GUO Yang . A hierarchical hardware barrier synchronization design for many-core processors [J]. Computer Engineering & Science, 2022, 44(11): 1901-1908.
[2]	BAN Dongsong1,YANG Wei2,SONG Lei3,JIANG Jie3,DOU Wenhua3. Scheduling Algorithms for Barrier Coverage to Maximize Network Life Time [J]. J4, 2012, 34(12): 16-21.
[3]	. [J]. J4, 2006, 28(8): 73-74.

Optimization of MPI_Barrier based on the offloading characteristics of Tianhe-2

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 3

Recommended Articles

Metrics

Comments