• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2020, Vol. 42 ›› Issue (10高性能专刊): 1801-1806.

• 高性能计算机系统软件 • 上一篇    下一篇

面向大规模集群的自动化监控系统

杨杰,曾凌波,彭运勇,蒋迁谦,杜量   

  1. (中山大学国家超级计算广州中心,广东 广州 510000)
  • 收稿日期:2020-06-09 修回日期:2020-07-15 接受日期:2020-10-25 出版日期:2020-10-25 发布日期:2020-10-23

An automated monitoring system for large-scale supercomputers

YANG Jie,ZENG Ling-bo,PENG Yun-yong,JIANG Qian-qian,DU Liang   

  1. (National Supercomputing Center in Guangzhou,Sun Yat-Sen University,Guangzhou 510000,China)

  • Received:2020-06-09 Revised:2020-07-15 Accepted:2020-10-25 Online:2020-10-25 Published:2020-10-23

摘要: 大规模集群系统结点数量越来越多、内部结构越来越复杂,集群可用性、稳定性的压力也越来越大,为了解决大规模集群可用性、稳定性的问题以及系统管理和系统运维难度大的问题,实现了一套大规模集群自动化监控系统。该自动化监控系统部署在大规模集群系统上,通过收集集群各组件的监控数据,利用微服务的方式处理监控数据,实现对集群各组件的实时监控。


关键词: 大规模, 集群, 监控, 微服务

Abstract: he number of large-scale cluster system nodes is increasing, the internal structure is becoming more and more complex, and the pressure on cluster availability and stability is also increasing. In order to solve the problems of the availability and stability of large-scale clusters and the difficulty of system management, operation and maintenance, an automated monitoring system for large-scale clusters is realized. The automated monitoring system is deployed on a large-scale cluster system. By collecting monitoring data of each cluster component and using microservices to process the monitoring data, the real-time monitoring of the cluster components are realized.



Key words: large-scale, cluster, monitor, microservices