• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2021, Vol. 43 ›› Issue (01): 33-41.

Previous Articles     Next Articles

FD-LSTM: A fault analysis model based on large-scale system logs

FANG Jiao-li,ZUO Ke,HUANG Chun,LIU Jie,LI Sheng-guo,LU Kai   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

  • Received:2020-06-11 Revised:2020-07-17 Accepted:2021-01-25 Online:2021-01-25 Published:2021-01-22

Abstract: Reliability research is a classic problem in the field of high-performance computing. With the continuous development of process technology and integrated technology, the current scale of the entire system has grown exponentially, which has brought great challenges to reliability research, especially failure analysis. This paper collects 203510247 pieces of work failure log information after the operation of the independent high-performance computing system, from January 28, 2016 to December 6, 2016. Firstly, the K-Means clustering method is used to classify the faults and analyze the fault distribution characteristics. Secondly, based on the clustering results, a time-based fault analysis model FD-LSTM is designed. After training with structured logs, the occurrence time and space of different fault types are predicted. The results show that the accuracy of the proposed FD-LSTM prediction model can reach 80.56%. The research in this paper shows that, compared with the traditional fault analysis mo- del, in terms of time prediction and spatial prediction, the time series model FD-LSTM based on log information have practical guiding significance in improving the accuracy of fault analysis, enhancing the efficiency of machine operation and maintenance, improving the rationalization of collaborative whole system design, and other aspects.




Key words: system log, long short-term memory, K-Means, fault analysis ,