• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2026, Vol. 48 ›› Issue (3): 389-397.

• 高性能计算 • 上一篇    下一篇

基于GPU共享的深度学习训练任务加速调度框架

林辰汐,李嘉伦,莫萱,周杰英,吴维刚   

  1. (1.中山大学计算机学院,广东 广州 510006;2.广东技术师范大学计算机科学学院,广东 广州 510665)

  • 收稿日期:2025-07-10 修回日期:2025-08-20 出版日期:2026-03-25 发布日期:2026-03-25
  • 基金资助:
    广东省自然科学基金 (2025A1515011663)

A GPU-sharing-based scheduling framework for accelerating deep learning training tasks

LIN Chenxi,LI Jialun,MO Xuan,ZHOU Jieying,WU Weigang   

  1. (1.School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510006;
    2.School of Computer Science,Guangdong Polytechnic Normal University,Guangzhou 510665,China)
  • Received:2025-07-10 Revised:2025-08-20 Online:2026-03-25 Published:2026-03-25

摘要: 深度学习DL在众多业务场景中的应用越来越广泛。如何在GPU集群中高效利用资源训练DL任务并缩短任务的完成时间,受到了工业界和学术界的持续关注。单个DL训练任务往往无法充分利用GPU的全部计算资源,传统调度器的独占式GPU分配导致资源利用率低下。提出一种基于GPU共享的任务调度框架G-Share,允许多个DL任务共享同一个GPU进行训练,即进行混部调度。在感知任务间混部干扰的基础上进行任务调度与资源分配,以提高GPU利用率进而加速任务的执行。具体来说,首先通过离线建模与在线更新的方式刻画任务间相互干扰的信息,并将基于GPU共享的调度问题建模为一个带权二部图最小匹配问题,通过求解该问题来获得资源分配结果,并结合时间片机制实现任务的动态调度来感知在线场景中任务最优混部组合的变化。在商汤科技的DL任务负载数据集上的实验表明,G-Share相比于对比方法实现了20.6%的任务平均完成时间减少。


关键词: 云计算, 深度学习, 资源调度, GPU共享;任务间干扰

Abstract: Deep learning (DL)  is increasingly being applied across a wide range of business scenarios. How to efficiently utilize resources in GPU clusters for training DL tasks and reduce task completion times has garnered sustained attention from both industry and academia. A single DL training task often fails to fully leverage all the computational resources of a GPU, and the exclusive GPU allocation by traditional schedulers leads to low resource utilization. This paper proposes a GPU-sharing-based task scheduling framework, G-Share, which allows multiple DL tasks to be trained on the same GPU simultaneously, enabling co-location scheduling. Task scheduling and resource allocation are performed while being aware of the interference between co-located tasks, aiming to enhance GPU utilization and thereby accelerate task execution. Specifically, it first characterizes the mutual interference information between tasks through offline modeling and online updates, and models the GPU-sharing-based scheduling problem as a weighted bipartite graph minimum matching problem. By solving this problem, resource allocation results are obtained, and a dynamic task scheduling mechanism combined with time-slicing is employed to perceive changes in the optimal co-location combinations of tasks in online scenarios. Experiments conducted on the DL task workload data from SenseTime demonstrates that G-Share achieves a 20.6% reduction in the average task completion times compared to benchmark methods.


Key words: cloud computing, deep learning, resource scheduling, GPU sharing, task interference