• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2026, Vol. 48 ›› Issue (3): 389-397.

• High Performance Computing • Previous Articles     Next Articles

A GPU-sharing-based scheduling framework for accelerating deep learning training tasks

LIN Chenxi,LI Jialun,MO Xuan,ZHOU Jieying,WU Weigang   

  1. (1.School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510006;
    2.School of Computer Science,Guangdong Polytechnic Normal University,Guangzhou 510665,China)
  • Received:2025-07-10 Revised:2025-08-20 Online:2026-03-25 Published:2026-03-25

Abstract: Deep learning (DL)  is increasingly being applied across a wide range of business scenarios. How to efficiently utilize resources in GPU clusters for training DL tasks and reduce task completion times has garnered sustained attention from both industry and academia. A single DL training task often fails to fully leverage all the computational resources of a GPU, and the exclusive GPU allocation by traditional schedulers leads to low resource utilization. This paper proposes a GPU-sharing-based task scheduling framework, G-Share, which allows multiple DL tasks to be trained on the same GPU simultaneously, enabling co-location scheduling. Task scheduling and resource allocation are performed while being aware of the interference between co-located tasks, aiming to enhance GPU utilization and thereby accelerate task execution. Specifically, it first characterizes the mutual interference information between tasks through offline modeling and online updates, and models the GPU-sharing-based scheduling problem as a weighted bipartite graph minimum matching problem. By solving this problem, resource allocation results are obtained, and a dynamic task scheduling mechanism combined with time-slicing is employed to perceive changes in the optimal co-location combinations of tasks in online scenarios. Experiments conducted on the DL task workload data from SenseTime demonstrates that G-Share achieves a 20.6% reduction in the average task completion times compared to benchmark methods.


Key words: cloud computing, deep learning, resource scheduling, GPU sharing, task interference