分布式训练异构任务调度算法研究

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (7): 1160-1167.

分布式训练异构任务调度算法研究

杨坚伟，孟敏，黄家乐，武继刚

（广东工业大学计算机学院，广东广州510006）

收稿日期:2020-09-10 修回日期:2020-11-13 出版日期:2021-07-25 发布日期:2021-08-16
基金资助:
国家自然科学基金（61702114）；广东省自然科学基金（2020A1515011361）

Scheduling of heterogeneous tasks for distributed training

YANG Jian-wei,MENG Min,HUANG Jia-le,WU Ji-gang

(School of Computer Science and Technology,Guangdong University of Technology,Guangzhou 510006,China）

Received:2020-09-10 Revised:2020-11-13 Online:2021-07-25 Published:2021-08-16

摘要/Abstract

摘要： 分布式机器学习中的工作结点在训练过程中经常需要处理异构任务，但任务发布者可能无法根据有效的先验知识确定边缘服务器集群中哪些是处于训练状态的工作结点。针对边缘服务器集群无法同时满足训练性能与服务质量最大化的问题，对异构任务调度算法进行了研究。首先在集群资源约束下分析了分布式训练收敛性能的影响因素；其次建立了最大化训练性能的优化目标；最后转化为多维多选择背包问题进行求解。仿真结果表明，所提异构任务调度算法能够在保证服务质量的同时，最大化分布式训练性能。

关键词: 分布式训练, 训练性能, 异构任务调度, 多维多选择背包, 收敛分析

Abstract: Workers in distributed machine learning often need to deal with heterogeneous tasks during the training process. However, the task publisher may not be able to determine which workers in the cluster of edge server (ES) are currently in training based on effective prior knowledge. To tackle the problem that the ES cluster cannot fulfill the maximization of the training performance and the quality of service at the same time, a scheduling algorithm of heterogeneous tasks is proposed. Firstly, the factors influencing the convergence performance of distributed training are analyzed under the constraints about cluster’s resources. Secondly, the optimization objective for maximizing training performance is established. Finally, the optimization problem is transformed into a multidimensional multiple-choice knapsack problem. The simulation results show that the proposed scheduling algorithm of heterogeneous tasks can maximize the performance of distributed training and simultaneously ensure the quality of ser- vice.

Key words: distributed training, training performance, scheduling of heterogeneous tasks, multi- dimensional multiple-choice knapsack problem, convergence analysis

杨坚伟, 孟敏, 黄家乐, 武继刚. 分布式训练异构任务调度算法研究[J]. 计算机工程与科学, 2021, 43(7): 1160-1167.

YANG Jian-wei, MENG Min, HUANG Jia-le, WU Ji-gang. Scheduling of heterogeneous tasks for distributed training[J]. Computer Engineering & Science, 2021, 43(7): 1160-1167.

[1]	赵鑫博, 陆忠华. 面向深度行情因子挖掘的分布式训练关键技术研究[J]. 计算机工程与科学, 2024, 46(9): 1554-1565.
[2]	张家豪, 邓金易, 尹首一, 魏少军, 胡杨. 基于Actor模型的众核数据流硬件架构探索[J]. 计算机工程与科学, 2024, 46(6): 959-967.
[3]	魏嘉, 张兴军, 纪泽宇, 李靖波, 岳莹莹. 天河三号原型机分布式并行深度神经网络性能评测及调优[J]. 计算机工程与科学, 2021, 43(5): 782-791.
[4]	张立志, 冉浙江, 赖志权, 刘锋. 分布式深度学习通信架构的性能分析[J]. 计算机工程与科学, 2021, 43(3): 416-425.