• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (08): 1366-1375.

• 高性能计算 • 上一篇    下一篇

E级高性能计算机系统中监控分系统的挑战与设计

袁远,李世杰,邢建英,蒋句平   

  1. (国防科技大学计算机学院,湖南 长沙 410073)

  • 收稿日期:2020-06-17 修回日期:2020-09-18 接受日期:2021-08-25 出版日期:2021-08-25 发布日期:2021-08-24
  • 基金资助:
    国家重点研发计划(2018YFB0204301)

Monitoring subsystem for exascale HPC systems: Challenges and design

YUAN Yuan,LI Shi-jie,XING Jian-ying,JIANG Ju-ping   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

  • Received:2020-06-17 Revised:2020-09-18 Accepted:2021-08-25 Online:2021-08-25 Published:2021-08-24

摘要: 随着E级高性能计算机系统组装密度成倍增加,结点规模不断扩大,监控分系统在可扩展性、可靠性、可服务性和高效运维上面临巨大挑战。针对这些挑战,从架构、网络、功能和运维4个方面介绍了监控分系统的设计思路,并通过原型系统验证了部分设计的可行性与优势,对未来E级系统的构建具有较大的支撑作用。

关键词: E级高性能计算机系统, 监控分系统, 可扩展性, 可靠性

Abstract: The High-Performance Computer (HPC) systems built for future Exascale computing require a several-times increase of assembly density, along with the large expansion of node scale. This presents huge challenges for HPC monitoring subsystem in terms of scalability, reliability, serviceability, and maintenance. In response to these challenges, this paper introduces the design ideas of the monitoring subsystem from the four aspects of architecture, network, functionality, and maintenance, and verifies the feasibility and advantages of some designs through the prototype system, which can significantly benefit the construction of future exascale HPC systems.


Key words: exascale high-performance computer system, monitoring subsystem, scalability, reliability