• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2022, Vol. 44 ›› Issue (10): 1753-1761.

• High Performance Computing • Previous Articles     Next Articles

Job failure prediction based on user behavior on supercomputers

TANG Yang-kun1,2,XIAN Gang1,2,YANG Wen-xiang2,3,YU Jie2,ZHANG Xiao-rong1,WANG Yao-bin1   

  1. (1.School of Computer Science and Technology,Southwest University of Science and Technology,Mianyang 621010;
    2.Computational Aerodynamics Institute,China Aerodynamics Research and Development Center,Mianyang 621050;
    3.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2021-09-02 Revised:2022-01-10 Accepted:2022-10-25 Online:2022-10-25 Published:2022-10-28

Abstract: The scale of supercomputers is expanding. Meanwhile, the complexity of scientific applications is also increasing, which leads to many job failures on supercomputers. These failed jobs causes a waste of resources and prolong the waiting time of queuing jobs, which seriously affects the reliability of the system. If these failed jobs can be predicted in advance, necessary measures can be taken to improve the system resource utilization and system execution efficiency, which is very important for the future exascale supercomputers. Therefore, this paper attempts to predict these job failures from the known traditional features and construction features, and find the features and processing methods that can reflect the users work behavior patterns and submission behavior patterns. By combining behavior features and traditional features, a comprehensive framework based on tree structure model is proposed to predict job failure. The prediction experimental results show that the comprehensive prediction framework is better than the single model prediction, and the comparative experimental results show that the prediction effect is better than other related methods.

Key words: system execution efficiency, job log analysis, user behavior, job failure prediction, machine learning