Computer Engineering & Science ›› 2021, Vol. 43 ›› Issue (1): 33-41.
Previous Articles Next Articles
FANG Jiao-li,ZUO Ke,HUANG Chun,LIU Jie,LI Sheng-guo,LU Kai
Received:
Revised:
Online:
Published:
Abstract: Reliability research is a classic problem in the field of high-performance computing. With the continuous development of process technology and integrated technology, the current scale of the entire system has grown exponentially, which has brought great challenges to reliability research, especially failure analysis. This paper collects 203510247 pieces of work failure log information after the operation of the independent high-performance computing system, from January 28, 2016 to December 6, 2016. Firstly, the K-Means clustering method is used to classify the faults and analyze the fault distribution characteristics. Secondly, based on the clustering results, a time-based fault analysis model FD-LSTM is designed. After training with structured logs, the occurrence time and space of different fault types are predicted. The results show that the accuracy of the proposed FD-LSTM prediction model can reach 80.56%. The research in this paper shows that, compared with the traditional fault analysis mo- del, in terms of time prediction and spatial prediction, the time series model FD-LSTM based on log information have practical guiding significance in improving the accuracy of fault analysis, enhancing the efficiency of machine operation and maintenance, improving the rationalization of collaborative whole system design, and other aspects.
Key words: system log, long short-term memory, K-Means, fault analysis ,
FANG Jiao-li, ZUO Ke, HUANG Chun, LIU Jie, LI Sheng-guo, LU Kai. FD-LSTM: A fault analysis model based on large-scale system logs[J]. Computer Engineering & Science, 2021, 43(1): 33-41.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2021/V43/I1/33