• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (9): 1535-1543.

• High Performance Computing • Previous Articles     Next Articles

A job failure identification method of heterogeneous supercomputing platforms based on semantic analysis of multi-source logs

HU He1,ZHAO Yi1,GU Beibei1,2,ZHAO Yunqing1   

  1. (1.Computer Network Information Center,Chinese Academy of Sciences,Beijing 100083;
    2.University of Chinese Academy of Sciences,Beijing 100190,China)

  • Received:2024-12-18 Revised:2025-02-15 Online:2025-09-25 Published:2025-09-22

Abstract: This paper presents a method for detecting job anomalies in large-scale distributed HPC heterogeneous platforms.Analyzing job runtime logs is vital for detecting anomalies,but the sheer volume of logs hinders human comprehension.To address this,we introduce a multi-source log semantic analysis approach using latent Dirichlet allocation (LDA) to analyze logs from various sources.By modeling topic evolution over time and matching with historical faulty job patterns,it predicts anomalies.Experiments on a domestic HPC platform show 95.2% precision,enhancing predictive capability and aiding users and administrators in quickly diagnosing issues,thereby improving HPC environment availability and efficiency.

Key words: data processing, fault identification, hybrid heterogeneity, semantic analysis, latent Drichlet allocation(LDA)