• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (07): 1181-1190.

• 高性能计算 • 上一篇    下一篇

基于NR-Transformer的集群作业运行时间预测

陈奉贤   

  1. (兰州大学网络安全与信息化办公室,甘肃 兰州 730000)
  • 收稿日期:2021-04-02 修回日期:2021-09-14 接受日期:2022-07-25 出版日期:2022-07-25 发布日期:2022-07-25

Cluster job runtime prediction based on NR-Transformer

CHEN Feng-xian   

  1. (Office of Network Security and Information,Lanzhou University,Lanzhou 730000,China)
  • Received:2021-04-02 Revised:2021-09-14 Accepted:2022-07-25 Online:2022-07-25 Published:2022-07-25

摘要: 高性能集群的作业调度通常使用作业调度系统来实现,准确填写作业运行时间能在很大程度上提升作业调度效率。现有的研究通常使用机器学习的预测方式,在预测精度和实用性上还存在一定的提升空间。为了进一步提高集群作业运行时间预测的准确率,考虑先对集群作业日志进行聚类,将作业类别信息添加到作业特征中,再使用基于注意力机制的NR-Transformer网络对作业日志数据建模和预测。在数据处理上,根据与预测目标的相关性、特征的完整性和数据的有效性,从历史日志数据集中筛选出7维特征,并按作业运行时间的长度将其划分为多个作业集,再对各作业集分别进行训练和预测。实验结果表明,相比于传统机器学习和BP神经网络,时序神经网络结构有更好的预测性能,其中NR-Transformer在各作业集上都有较好的性能。

关键词: 高性能计算, 并行作业调度, 用户聚类, 时序神经网络, 注意力机制

Abstract: Job scheduling of high-performance clusters is usually implemented by the job scheduling system. Filling in the job running time accurately can greatly improve the efficiency of job scheduling. Existing research usually uses machine learning for prediction, and the prediction accuracy and practicality can be further improved. In order to further improve the accuracy of cluster job running time prediction, cluster job logs are firstly clustered, and job category information is added to job features. Secondly, the job log data is modeled and predicted using the attention-based NR-Transformer network. In data processing, according to the correlation with the prediction target, the integrity of the feature and the validity of the data, 7-dimensional features are selected from the historical log dataset, the dataset is divided into multiple job sets according to the length of the job running time, and then each job set is trained and predicted separately. The experimental results show that, compared with traditional machine learning and BP neural network, its timing neural network structure has better prediction performance, and NR-Transformer has better performance on each job set.

Key words: high performance computing, parallel job scheduling, user clustering, timing neural network, attention mechanism