• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (05): 772-781.

• 高性能计算 • 上一篇    下一篇

自治故障管理系统推理规则的智能学习技术

张莉丽,王睿伯,王晓东,张文喆   

  1. (国防科技大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2022-04-21 修回日期:2022-06-24 接受日期:2023-05-25 出版日期:2023-05-25 发布日期:2023-05-16

Inference rule learning in autonomous fault management systems

ZHANG Li-li,WANG Rui-bo,WANG Xiao-dong,ZHANG Wen-zhe   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2022-04-21 Revised:2022-06-24 Accepted:2023-05-25 Online:2023-05-25 Published:2023-05-16

摘要: 随着高性能计算机系统规模急剧增加,系统整体的固有可靠性逐步降低,产生了“可靠性墙”问题。为了应对这一挑战,天河高性能计算机系统设计了自治故障管理系统,通过该系统实时监控、分析、管理全系统的报警、故障和错误。自治故障管理系统所收集的故障消息垂直涵盖系统的各个逻辑层次,水平覆盖系统的全部功能模块,因此故障消息之间存在逻辑上的因果关系,即一个故障源会导致后续一系列的故障事件。提出了一种针对于故障信息的推理规则学习算法EMRL,把故障信息的推理规则建模为一个概率模型,通过该模型自动从故障信息中挖掘故障推理规则,并且根据挖掘结果自动生成最小的故障推理图。采用天河系统的部分运行数据,验证了EMRL算法的有效性,结果表明EMRL能有效挖掘故障信息的推理关系。

关键词: 推理规则学习, 故障管理, 自治管理

Abstract: With the rapid increase of the scale of high-performance computer system, the inherent reliability of the whole system gradually decreases, resulting in the “reliability wall” problem. In order to address this challenge, an autonomous fault management system is designed in Tianhe high performance computer system. It can monitor, analyz and manage alarms, faults and errors of the whole system in real time. Fault messages collected by this autonomous fault management system vertically cover all logical layers of the system and horizontally cover all functional modules of the system. Therefore, there is a logical causal relationship between fault messages, that is, a fault source causes a series of subsequent fault events. In this paper, a fault information inference rule learning algorithm EMRL is proposed. The fault information inference rule is modeled as a probabilistic model. Through this model, fault inference rules are automatically mined from fault information, and the minimum fault inference graph is automatically generated according to the mining results. The validity of EMRL algorithm is verified by partial operation data of Tianhe system. The results show that EMRL can effectively mine the inference relation of fault information.

Key words: inference rule learning, fault management, autonomous management ,