基于GPU共享的深度学习训练任务加速调度框架

计算机工程与科学 ›› 2026, Vol. 48 ›› Issue (3): 389-397.

基于GPU共享的深度学习训练任务加速调度框架

林辰汐,李嘉伦,莫萱,周杰英,吴维刚

(1.中山大学计算机学院，广东广州 510006；2.广东技术师范大学计算机科学学院，广东广州 510665)

收稿日期:2025-07-10 修回日期:2025-08-20 出版日期:2026-03-25 发布日期:2026-03-25
基金资助:
广东省自然科学基金 (2025A1515011663)

A GPU-sharing-based scheduling framework for accelerating deep learning training tasks

LIN Chenxi,LI Jialun,MO Xuan,ZHOU Jieying,WU Weigang

(1.School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510006；
2.School of Computer Science,Guangdong Polytechnic Normal University,Guangzhou 510665,China)

Received:2025-07-10 Revised:2025-08-20 Online:2026-03-25 Published:2026-03-25

摘要/Abstract

摘要： 深度学习DL在众多业务场景中的应用越来越广泛。如何在GPU集群中高效利用资源训练DL任务并缩短任务的完成时间，受到了工业界和学术界的持续关注。单个DL训练任务往往无法充分利用GPU的全部计算资源，传统调度器的独占式GPU分配导致资源利用率低下。提出一种基于GPU共享的任务调度框架G-Share，允许多个DL任务共享同一个GPU进行训练，即进行混部调度。在感知任务间混部干扰的基础上进行任务调度与资源分配，以提高GPU利用率进而加速任务的执行。具体来说，首先通过离线建模与在线更新的方式刻画任务间相互干扰的信息，并将基于GPU共享的调度问题建模为一个带权二部图最小匹配问题，通过求解该问题来获得资源分配结果，并结合时间片机制实现任务的动态调度来感知在线场景中任务最优混部组合的变化。在商汤科技的DL任务负载数据集上的实验表明，G-Share相比于对比方法实现了20.6%的任务平均完成时间减少。

关键词: 云计算, 深度学习, 资源调度, GPU共享；任务间干扰

Abstract: Deep learning (DL) is increasingly being applied across a wide range of business scenarios. How to efficiently utilize resources in GPU clusters for training DL tasks and reduce task completion times has garnered sustained attention from both industry and academia. A single DL training task often fails to fully leverage all the computational resources of a GPU, and the exclusive GPU allocation by traditional schedulers leads to low resource utilization. This paper proposes a GPU-sharing-based task scheduling framework, G-Share, which allows multiple DL tasks to be trained on the same GPU simultaneously, enabling co-location scheduling. Task scheduling and resource allocation are performed while being aware of the interference between co-located tasks, aiming to enhance GPU utilization and thereby accelerate task execution. Specifically, it first characterizes the mutual interference information between tasks through offline modeling and online updates, and models the GPU-sharing-based scheduling problem as a weighted bipartite graph minimum matching problem. By solving this problem, resource allocation results are obtained, and a dynamic task scheduling mechanism combined with time-slicing is employed to perceive changes in the optimal co-location combinations of tasks in online scenarios. Experiments conducted on the DL task workload data from SenseTime demonstrates that G-Share achieves a 20.6% reduction in the average task completion times compared to benchmark methods.

Key words: cloud computing, deep learning, resource scheduling, GPU sharing, task interference

林辰汐, 李嘉伦, 莫萱, 周杰英, 吴维刚. 基于GPU共享的深度学习训练任务加速调度框架[J]. 计算机工程与科学, 2026, 48(3): 389-397.

LIN Chenxi, LI Jialun, MO Xuan, ZHOU Jieying, WU Weigang. A GPU-sharing-based scheduling framework for accelerating deep learning training tasks[J]. Computer Engineering & Science, 2026, 48(3): 389-397.

[1]	文韬, 王天一, 黄诗锐, 周江龙. 基于改进YOLOv8的农作物与藜草检测模型：MES-YOLO[J]. 计算机工程与科学, 2026, 48(3): 434-443.
[2]	曹利, 徐慧英, 谢刚, 李毅, 黄晓, 陈昊, 朱信忠. ASOD-YOLO：基于YOLOv8n改进的航空小目标检测模型[J]. 计算机工程与科学, 2026, 48(1): 133-145.
[3]	鲜领, 徐修远, 周凯, 牛颢, 郭际香. 基于新的自适应组合损失函数的肺部气道CT图像分割方法[J]. 计算机工程与科学, 2026, 48(1): 119-132.
[4]	王凤英1, 2, 宋子凯2, 张岩1, 杜利明1. 基于MMD-GA的深度学习测试集优化约简[J]. 计算机工程与科学, 2025, 47(9): 1700-1710.
[5]	王燕, 刘晶晶, 胡津源, 陈燕燕. 基于Transformer的逐像素细节补偿去雾网络[J]. 计算机工程与科学, 2025, 47(9): 1647-1657.
[6]	李志鹏1, 陈丹阳1, 2, 钟诚1, 2. 一种适合大面积破损图像的多重修复网络[J]. 计算机工程与科学, 2025, 47(9): 1638-1646.
[7]	徐雯, 于瓅. 基于迭代收缩阈值与深度学习的压缩感知图像重构网络[J]. 计算机工程与科学, 2025, 47(3): 485-493.
[8]	尹春勇, 张小虎. 基于Transformer和Text-CNN的日志异常检测[J]. 计算机工程与科学, 2025, 47(3): 448-458.
[9]	许天佑, 高光勇. 基于可逆生成对抗网络的鲁棒图像隐藏[J]. 计算机工程与科学, 2025, 47(2): 288-297.
[10]	刘拥民, 许成, 黄浩, 张钱垒, 赵俊杰, . 基于SAE和WGAN的入侵检测方法研究[J]. 计算机工程与科学, 2025, 47(2): 256-264.
[11]	李金忠, 刘伟东, 陈盛博. 搜索结果多样化排序：新进展与展望[J]. 计算机工程与科学, 2025, 47(12): 2227-2252.
[12]	刘高, 徐建良, 张先轶, 刘贤冬. OpenLM：多平台高性能的大语言模型推理框架[J]. 计算机工程与科学, 2025, 47(12): 2129-2138.
[13]	尹春勇, 李荣标. 基于门控特征融合与多尺度卷积的网络流量异常检测[J]. 计算机工程与科学, 2025, 47(11): 1953-1963.
[14]	徐超, 阮荣耀, 陈勇, . 一种基于区块链的医疗数据审计方法[J]. 计算机工程与科学, 2025, 47(1): 95-106.
[15]	吴玉虹, 王建. 基于Patches-CNN的模拟电路故障诊断[J]. 计算机工程与科学, 2025, 47(1): 35-44.