A GPU-sharing-based scheduling framework for accelerating deep learning training tasks

Computer Engineering & Science ›› 2026, Vol. 48 ›› Issue (3): 389-397.

• High Performance Computing • Previous Articles Next Articles

A GPU-sharing-based scheduling framework for accelerating deep learning training tasks

LIN Chenxi,LI Jialun,MO Xuan,ZHOU Jieying,WU Weigang

(1.School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510006；
2.School of Computer Science,Guangdong Polytechnic Normal University,Guangzhou 510665,China)

Received:2025-07-10 Revised:2025-08-20 Online:2026-03-25 Published:2026-03-25

Abstract

Abstract: Deep learning (DL) is increasingly being applied across a wide range of business scenarios. How to efficiently utilize resources in GPU clusters for training DL tasks and reduce task completion times has garnered sustained attention from both industry and academia. A single DL training task often fails to fully leverage all the computational resources of a GPU, and the exclusive GPU allocation by traditional schedulers leads to low resource utilization. This paper proposes a GPU-sharing-based task scheduling framework, G-Share, which allows multiple DL tasks to be trained on the same GPU simultaneously, enabling co-location scheduling. Task scheduling and resource allocation are performed while being aware of the interference between co-located tasks, aiming to enhance GPU utilization and thereby accelerate task execution. Specifically, it first characterizes the mutual interference information between tasks through offline modeling and online updates, and models the GPU-sharing-based scheduling problem as a weighted bipartite graph minimum matching problem. By solving this problem, resource allocation results are obtained, and a dynamic task scheduling mechanism combined with time-slicing is employed to perceive changes in the optimal co-location combinations of tasks in online scenarios. Experiments conducted on the DL task workload data from SenseTime demonstrates that G-Share achieves a 20.6% reduction in the average task completion times compared to benchmark methods.

Key words: cloud computing, deep learning, resource scheduling, GPU sharing, task interference

LIN Chenxi, LI Jialun, MO Xuan, ZHOU Jieying, WU Weigang. A GPU-sharing-based scheduling framework for accelerating deep learning training tasks[J]. Computer Engineering & Science, 2026, 48(3): 389-397.

[1]	WEN Tao, WANG Tianyi, HUANG Shirui, ZHOU Jianglong. An improved YOLOv8-based model for crop and pigweed detection:MES-YOLO [J]. Computer Engineering & Science, 2026, 48(3): 434-443.
[2]	CAO Li1, 2, XU Huiying2, 3, XIE Gang1, LI Yi2, 3, HUANG Xiao4, CHEN Hao2, 3, ZHU Xinzhong2, 3, 5. ASOD-YOLO: An improved aerial small object detection model based on YOLOv8n [J]. Computer Engineering & Science, 2026, 48(1): 133-145.
[3]	XIAN Ling, XU Xiuyuan, ZHOU Kai, NIU Hao, GUO Jixiang. A pulmonary airway CT image segmentation method based on a novel adaptive combined loss function [J]. Computer Engineering & Science, 2026, 48(1): 119-132.
[4]	WANG Fengying1, 2, SONG Zikai2, ZHANG Yan1, DU Liming1. Optimization and reduction for deep learning test set based on MMD-GA [J]. Computer Engineering & Science, 2025, 47(9): 1700-1710.
[5]	WANG Yan, LIU Jingjing, HU Jinyuan, CHEN Yanyan. A Transformer-based pixel-by-pixel detail compensation dehazing network [J]. Computer Engineering & Science, 2025, 47(9): 1647-1657.
[6]	LI Zhipeng1, CHEN Danyang1, 2, ZHONG Cheng1, 2. A multiple restoration network for large broken images [J]. Computer Engineering & Science, 2025, 47(9): 1638-1646.
[7]	XU Wen, YU Li. A compressive sensing image reconstruction network based on iterative shrinkage thresholding and deep learning [J]. Computer Engineering & Science, 2025, 47(3): 485-493.
[8]	YIN Chunyong, ZHANG Xiaohu. Log anomaly detection based on Transformer and Text-CNN [J]. Computer Engineering & Science, 2025, 47(3): 448-458.
[9]	XU Tianyou, GAO Guangyong. Robust image hiding by invertible generative adversarial network [J]. Computer Engineering & Science, 2025, 47(2): 288-297.
[10]	LIU Yongmin, XU Cheng, HUANG Hao, ZHANG Qianlei, ZHAO Junjie, . Research on intrusion detection method based on SAE and WGAN [J]. Computer Engineering & Science, 2025, 47(2): 256-264.
[11]	LI Jinzhong, LIU Weidong, CHEN Shengbo. Diversified ranking of search result: Recent progress and prospects [J]. Computer Engineering & Science, 2025, 47(12): 2227-2252.
[12]	LIU Gao, XU Jianliang, ZHANG Xianyi, LIU Xiandong. OpenLM: A multi-platform and high-performance large language model inference framework [J]. Computer Engineering & Science, 2025, 47(12): 2129-2138.
[13]	YIN Chunyong, LI Rongbiao. Network traffic anomaly detection based on gated fusion and multi-scale convolution [J]. Computer Engineering & Science, 2025, 47(11): 1953-1963.
[14]	XU Chao, RUAN Rongyao, CHEN Yong, . A blockchain-based medical data auditing method [J]. Computer Engineering & Science, 2025, 47(1): 95-106.
[15]	WU Yuhong, WANG Jian. Fault diagnosis of analog circuits based on Patches-CNN [J]. Computer Engineering & Science, 2025, 47(1): 35-44.

A GPU-sharing-based scheduling framework for accelerating deep learning training tasks

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 15

Recommended Articles

Metrics

Comments