• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2026, Vol. 48 ›› Issue (1): 28-39.

• 高性能计算 • 上一篇    下一篇

超算集群作业特征分析与运行时间预测

杨泓桢,程伟,杜量,黄聃,曾楚轩,肖侬   

  1. (1.中山大学计算机学院,广东 广州 510006;
    2.中国联合网络通信有限公司广东省分公司算网研究运营基地,广东 广州 510630)

  • 收稿日期:2024-11-26 修回日期:2024-12-28 出版日期:2026-01-25 发布日期:2026-01-25

Characteristics analysis and runtime prediction of jobs in supercomputer

YANG Hongzhen,CHENG Wei,DU Liang,HUANG Dan,ZENG Chuxuan,XIAO Nong   

  1. (1.School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510006;
    2.Computing Network Research and Operation Department,Guangdong Unicom,Guangzhou 510630,China) 
  • Received:2024-11-26 Revised:2024-12-28 Online:2026-01-25 Published:2026-01-25

摘要: 高性能计算集群的作业日志可以用来分析系统工作负载,发现系统使用的周期性规律、作业特征之间的相关性和用户行为模式,并进一步帮助开发运行时间预测模型,降低作业运行时间估计值误差,提高作业回填调度的性能。现有的预测算法侧重于提高作业运行时间的平均预测准确率,而忽略了预测值低于实际运行时间的情况(低估预测),可能导致调度器提前终止执行中的作业,降低系统资源的有效利用率。为解决上述问题,在对HPC作业特征的长期变化趋势和相关性开展分析的基础上,提出了一个集成学习模型预测作业运行时间,并提出有序扩展最大值策略调整集成模型的预测结果。实验结果表明,作业运行时间预测模型在保持较高预测准确率的同时显著降低了低估率,并且具有较好的稳定性和泛化能力。


关键词: 高性能计算, 大规模系统, 特征分析, 运行时间预测, 集成学习

Abstract: Job logs of high-performance computing (HPC) clusters can be utilized to analyze system workloads, identify periodic patterns in system usage, correlations among job characteristics, and user behavior patterns. This analysis further facilitates the development of a runtime prediction model, reducing the error in estimated job runtimes  and enhancing the performance of job backfilling scheduling. Existing prediction algorithms primarily focus on improving the average prediction accuracy of job runtimes but overlooking scenarios where predicted values fall below actual runtimes (underprediction), which may cause the scheduler to prematurely terminate running jobs, thereby reducing the effective  utilization of system resources. To address the aforementioned issue, based on an analysis of the long-term trends and correlations of HPC job characteristics, this paper proposes an ensemble learning model to predict job runtimes and introduces an ordered extended maximum strategy to adjust the prediction results of the ensemble model. Experimental results  demonstrate that the job runtime prediction model significantly reduces the underprediction rate while maintaining high prediction accuracy, and it exhibits good stability and generalization capabilities.

Key words: high-performance computing, large-scale system, characteristics analysis, runtime prediction, ensemble learning