• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (08): 1331-1340.

• 高性能计算 • 上一篇    下一篇

用户QoS感知的GPU集群深度学习任务动态调度

罗磊,陈照云,王俪璇   

  1. (国防科技大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2020-06-09 修回日期:2020-09-23 接受日期:2021-08-25 出版日期:2021-08-25 发布日期:2021-08-24
  • 基金资助:
    国家自然科学基金(61872377);国家重点研发计划(2018YFB0204301)

User QoS-aware deep learning task dynamic scheduling on GPU clusters

LUO Lei,CHEN Zhao-yun,WANG Li-xuan   

  1. (College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)

  • Received:2020-06-09 Revised:2020-09-23 Accepted:2021-08-25 Online:2021-08-25 Published:2021-08-24

摘要:

提出一种GPU集群下用户服务质量QoS感知的深度学习研发平台上的动态任务调度方法。采用离线评估模块对深度学习任务进行离线评测并构建计算性能预测模型。在线调度模块基于性能预测模型,结合任务的预期QoS,共同开展任务放置和任务执行顺序的调度。在一个分布式GPU集群实例上的实验表明,该方法相比其他基准策略能够实现更高的QoS保证率和集群资源利用率。

关键词: 深度学习, GPU集群, 任务调度, QoS

Abstract:

A QoS (Quality of Service)-aware deep learning task dynamic scheduling method on GPU clusters is proposed. The offline evaluation module is used to perform offline evaluation of deep learning tasks and build a computational performance prediction model. Based on the performance prediction model, combined with the expected QoS of the task, the online scheduling module carries out the scheduling of task placement and task execution sequence. Experiments on a distributed GPU cluster demonstrate that the proposed method can achieve higher QoS-guarantee percentage and cluster resource utilization than other baseline schedulers.

Key words: deep learning, GPU cluster, task scheduling;QoS