• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (01): 33-41.

• 高性能计算 • 上一篇    下一篇

FD-LSTM:基于大规模系统日志的故障分析模型

方姣丽,左克,黄春,刘杰,李胜国,卢凯   

  1. (国防科技大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2020-06-11 修回日期:2020-07-17 接受日期:2021-01-25 出版日期:2021-01-25 发布日期:2021-01-22
  • 基金资助:
    国家数值风洞工程项目(NNW2019ZT6-B20,NNW2019ZT5-A10,NNW2019ZT6-B21)

FD-LSTM: A fault analysis model based on large-scale system logs

FANG Jiao-li,ZUO Ke,HUANG Chun,LIU Jie,LI Sheng-guo,LU Kai   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

  • Received:2020-06-11 Revised:2020-07-17 Accepted:2021-01-25 Online:2021-01-25 Published:2021-01-22

摘要: 可靠性研究是高性能计算领域的经典问题,随着制程技术与集成工艺的不断发展,当前全系统规模呈指数级快速增长,给可靠性研究尤其是故障分析带来巨大挑战。收集了自主高性能计算系统投产后工作故障日志信息203 510 247条,时间自2016年1月28日至2016年12月6日。首先使用K-Means聚类方法对故障进行分类,并分析故障分布特征。接着基于聚类结果设计基于时序的故障分析模型FD-LSTM,使用结构化日志训练后,预测不同故障类型的发生时间和空间,结果表明所提出的FD-LSTM 预测模型准确率可达80.56%。本文研究表明,基于日志信息的时序模型FD-LSTM在时间预测和空间预测方面,较之前传统的故障分析模型,在提高故障分析准确度、加强机器运维高效性,乃至增进全系统协同设计合理化等方面都具有现实的指导意义。


关键词: 系统日志, LSTM, K-Means, 故障分析

Abstract: Reliability research is a classic problem in the field of high-performance computing. With the continuous development of process technology and integrated technology, the current scale of the entire system has grown exponentially, which has brought great challenges to reliability research, especially failure analysis. This paper collects 203510247 pieces of work failure log information after the operation of the independent high-performance computing system, from January 28, 2016 to December 6, 2016. Firstly, the K-Means clustering method is used to classify the faults and analyze the fault distribution characteristics. Secondly, based on the clustering results, a time-based fault analysis model FD-LSTM is designed. After training with structured logs, the occurrence time and space of different fault types are predicted. The results show that the accuracy of the proposed FD-LSTM prediction model can reach 80.56%. The research in this paper shows that, compared with the traditional fault analysis mo- del, in terms of time prediction and spatial prediction, the time series model FD-LSTM based on log information have practical guiding significance in improving the accuracy of fault analysis, enhancing the efficiency of machine operation and maintenance, improving the rationalization of collaborative whole system design, and other aspects.




Key words: system log, long short-term memory, K-Means, fault analysis ,