• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2021, Vol. 43 ›› Issue (08): 1366-1375.

Previous Articles     Next Articles

Monitoring subsystem for exascale HPC systems: Challenges and design

YUAN Yuan,LI Shi-jie,XING Jian-ying,JIANG Ju-ping   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

  • Received:2020-06-17 Revised:2020-09-18 Accepted:2021-08-25 Online:2021-08-25 Published:2021-08-24

Abstract: The High-Performance Computer (HPC) systems built for future Exascale computing require a several-times increase of assembly density, along with the large expansion of node scale. This presents huge challenges for HPC monitoring subsystem in terms of scalability, reliability, serviceability, and maintenance. In response to these challenges, this paper introduces the design ideas of the monitoring subsystem from the four aspects of architecture, network, functionality, and maintenance, and verifies the feasibility and advantages of some designs through the prototype system, which can significantly benefit the construction of future exascale HPC systems.


Key words: exascale high-performance computer system, monitoring subsystem, scalability, reliability