• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (11): 54-61.

• 论文 • 上一篇    下一篇

多集群计算环境故障监控管理系统

张毅,陈良,庞剑   

  1. (中国空气动力研究与发展中心计算空气动力学研究所,四川 绵阳 621000)
  • 收稿日期:2012-09-12 修回日期:2012-11-21 出版日期:2013-11-25 发布日期:2013-11-25

Fault monitoring and management system
for multiple computing clusters  

ZHANG Yi,CHEN Liang,PANG Jian   

  1. (Computational Aerodynamics Institute,China Aerodynamics Research & Development Center,Mianyang 621000,China)
  • Received:2012-09-12 Revised:2012-11-21 Online:2013-11-25 Published:2013-11-25

摘要:

随着高性能计算集群系统的数量及其节点规模的不断扩大,系统运行维护的难度和工作量也随之加大

。介绍的软件系统工作在多套不同软硬件环境的Linux集群系统中,采用命令行脚本程序对各集群中重要的

运行状态和指标进行自动监测,并利用socket通信的方式及时将发现的故障信息集中发送到系统管理员

Windows终端,切实提高了系统运行维护工作的效率,加快了故障处理响应时间。该系统还利用数据库对故

障事件数据进行记录管理,规范了故障处理的流程。

关键词: 集群, 故障, 监控, 管理, 数据库

Abstract:

With the increasing number and scale of high performance

computing cluster systems, the system maintenance becomes more difficult and the workload is

getting larger. The software system we introduce in the paper works in multiple Linux clusters

with different hardware and software environment, automatically monitors the important

operating states and indexes of clusters by command line scripts and programs, and sends

faults messages to the Windows terminal of system administrators in time by means of socket

communication. Results demonstrate that this system improves the efficiency of system

maintenance and speeds up the response time of faults handling. Using database, it also

records and manages faults event data, thus standardizing the process of faults handling.

Key words: cluster;fault;monitor;manage;database