Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (9): 1535-1543.
• High Performance Computing • Previous Articles Next Articles
HU He1,ZHAO Yi1,GU Beibei1,2,ZHAO Yunqing1
Received:
Revised:
Online:
Published:
Abstract: This paper presents a method for detecting job anomalies in large-scale distributed HPC heterogeneous platforms.Analyzing job runtime logs is vital for detecting anomalies,but the sheer volume of logs hinders human comprehension.To address this,we introduce a multi-source log semantic analysis approach using latent Dirichlet allocation (LDA) to analyze logs from various sources.By modeling topic evolution over time and matching with historical faulty job patterns,it predicts anomalies.Experiments on a domestic HPC platform show 95.2% precision,enhancing predictive capability and aiding users and administrators in quickly diagnosing issues,thereby improving HPC environment availability and efficiency.
Key words: data processing, fault identification, hybrid heterogeneity, semantic analysis, latent Drichlet allocation(LDA)
HU He1, ZHAO Yi1, GU Beibei1, 2, ZHAO Yunqing1. A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs[J]. Computer Engineering & Science, 2025, 47(9): 1535-1543.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2025/V47/I9/1535