• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (11): 54-61.

• 论文 • Previous Articles     Next Articles

Fault monitoring and management system
for multiple computing clusters  

ZHANG Yi,CHEN Liang,PANG Jian   

  1. (Computational Aerodynamics Institute,China Aerodynamics Research & Development Center,Mianyang 621000,China)
  • Received:2012-09-12 Revised:2012-11-21 Online:2013-11-25 Published:2013-11-25

Abstract:

With the increasing number and scale of high performance

computing cluster systems, the system maintenance becomes more difficult and the workload is

getting larger. The software system we introduce in the paper works in multiple Linux clusters

with different hardware and software environment, automatically monitors the important

operating states and indexes of clusters by command line scripts and programs, and sends

faults messages to the Windows terminal of system administrators in time by means of socket

communication. Results demonstrate that this system improves the efficiency of system

maintenance and speeds up the response time of faults handling. Using database, it also

records and manages faults event data, thus standardizing the process of faults handling.

Key words: cluster;fault;monitor;manage;database