• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2023, Vol. 45 ›› Issue (05): 772-781.

• High Performance Computing • Previous Articles     Next Articles

Inference rule learning in autonomous fault management systems

ZHANG Li-li,WANG Rui-bo,WANG Xiao-dong,ZHANG Wen-zhe   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2022-04-21 Revised:2022-06-24 Accepted:2023-05-25 Online:2023-05-25 Published:2023-05-16

Abstract: With the rapid increase of the scale of high-performance computer system, the inherent reliability of the whole system gradually decreases, resulting in the “reliability wall” problem. In order to address this challenge, an autonomous fault management system is designed in Tianhe high performance computer system. It can monitor, analyz and manage alarms, faults and errors of the whole system in real time. Fault messages collected by this autonomous fault management system vertically cover all logical layers of the system and horizontally cover all functional modules of the system. Therefore, there is a logical causal relationship between fault messages, that is, a fault source causes a series of subsequent fault events. In this paper, a fault information inference rule learning algorithm EMRL is proposed. The fault information inference rule is modeled as a probabilistic model. Through this model, fault inference rules are automatically mined from fault information, and the minimum fault inference graph is automatically generated according to the mining results. The validity of EMRL algorithm is verified by partial operation data of Tianhe system. The results show that EMRL can effectively mine the inference relation of fault information.

Key words: inference rule learning, fault management, autonomous management ,