• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (10): 1753-1761.

• 高性能计算 • 上一篇    下一篇

基于用户行为的超级计算机作业失败预测方法

唐阳坤1,2,鲜港1,2,杨文祥2,3,喻杰2,张晓蓉1,王耀彬1   

  1. (1.西南科技大学计算机科学与技术学院,四川 绵阳 621010;
    2.中国空气动力研究与发展中心计算空气动力研究所,四川 绵阳 621050;3.国防科技大学计算机学院,湖南  长沙 410073)

  • 收稿日期:2021-09-02 修回日期:2022-01-10 接受日期:2022-10-25 出版日期:2022-10-25 发布日期:2022-10-28
  • 基金资助:
    国家自然科学基金(61872304,61802320);空气动力学国家重点实验室基金(SKLA20200203)

Job failure prediction based on user behavior on supercomputers

TANG Yang-kun1,2,XIAN Gang1,2,YANG Wen-xiang2,3,YU Jie2,ZHANG Xiao-rong1,WANG Yao-bin1   

  1. (1.School of Computer Science and Technology,Southwest University of Science and Technology,Mianyang 621010;
    2.Computational Aerodynamics Institute,China Aerodynamics Research and Development Center,Mianyang 621050;
    3.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2021-09-02 Revised:2022-01-10 Accepted:2022-10-25 Online:2022-10-25 Published:2022-10-28

摘要: 超级计算机的规模不断扩大,与此同时,科学应用的复杂性也在不断增加,这导致了超级计算机上许多作业失败。作业失败会造成资源浪费,排队作业等待时间延长,严重影响系统的执行效率。提前预测作业失败,就可以采取必要的措施提升系统资源利用率和系统执行效率,这对未来的E级超级计算机至关重要。为此,尝试研究从已知的传统特征和构建特征中预测作业失败,发现能够反映用户工作行为模式和提交行为模式的特征及处理方式。通过结合行为特征和传统特征,提出基于树结构模型的综合框架来预测作业失败。实验结果表明,预测效果优于其他相关方法。

关键词: 系统执行效率, 作业日志分析, 用户行为, 作业失败预测, 机器学习

Abstract: The scale of supercomputers is expanding. Meanwhile, the complexity of scientific applications is also increasing, which leads to many job failures on supercomputers. These failed jobs causes a waste of resources and prolong the waiting time of queuing jobs, which seriously affects the reliability of the system. If these failed jobs can be predicted in advance, necessary measures can be taken to improve the system resource utilization and system execution efficiency, which is very important for the future exascale supercomputers. Therefore, this paper attempts to predict these job failures from the known traditional features and construction features, and find the features and processing methods that can reflect the users work behavior patterns and submission behavior patterns. By combining behavior features and traditional features, a comprehensive framework based on tree structure model is proposed to predict job failure. The prediction experimental results show that the comprehensive prediction framework is better than the single model prediction, and the comparative experimental results show that the prediction effect is better than other related methods.

Key words: system execution efficiency, job log analysis, user behavior, job failure prediction, machine learning