• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2020, Vol. 42 ›› Issue (10高性能专刊): 1801-1806.

Previous Articles     Next Articles

An automated monitoring system for large-scale supercomputers

YANG Jie,ZENG Ling-bo,PENG Yun-yong,JIANG Qian-qian,DU Liang   

  1. (National Supercomputing Center in Guangzhou,Sun Yat-Sen University,Guangzhou 510000,China)

  • Received:2020-06-09 Revised:2020-07-15 Accepted:2020-10-25 Online:2020-10-25 Published:2020-10-23

Abstract: he number of large-scale cluster system nodes is increasing, the internal structure is becoming more and more complex, and the pressure on cluster availability and stability is also increasing. In order to solve the problems of the availability and stability of large-scale clusters and the difficulty of system management, operation and maintenance, an automated monitoring system for large-scale clusters is realized. The automated monitoring system is deployed on a large-scale cluster system. By collecting monitoring data of each cluster component and using microservices to process the monitoring data, the real-time monitoring of the cluster components are realized.



Key words: large-scale, cluster, monitor, microservices