面向模型并行训练的模型拆分策略自动生成方法

计算机工程与科学 ›› 2020, Vol. 42 ›› Issue (09): 1529-1537.

面向模型并行训练的模型拆分策略自动生成方法

王丽1，郭振华1，曹芳1，高开1，赵雅倩1，赵坤2

（1.浪潮电子信息产业股份有限公司高效能服务器与存储技术国家重点实验室，山东济南 250000；

2.广东浪潮大数据研究有限公司，广东广州 510000）

收稿日期:2020-04-08 修回日期:2020-06-11 接受日期:2020-09-25 出版日期:2020-09-25 发布日期:2020-09-24

An automatic model splitting strategy generation method for model parallel training

WANG Li1,GUO Zhen-hua1,CAO Fang1,GAO Kai1,ZHAO Ya-qian1,ZHAO Kun2

（1.State Key Laboratory of High-End & Storage Technology,Inspur Electronic Information Industry Co.Ltd.,Jinan 250000；

2.Guangdong Inspur Big Data Research Co.Ltd.,Guangzhou 510000,China）

Received:2020-04-08 Revised:2020-06-11 Accepted:2020-09-25 Online:2020-09-25 Published:2020-09-24

摘要/Abstract

摘要： 随着训练数据规模的增大以及训练模型的日趋复杂，深度神经网络的训练成本越来越高，对计算平台提出了更高的算力需求，模型训练并行化成为增强其应用时效性的迫切需求。近年来基于分布式训练的AI加速器（如FPGA、TPU、AI芯片等）层出不穷，为深度神经网络并行训练提供了硬件基础。为了充分利用各种硬件资源，研究人员需要在集合了多种不同算力、不同硬件架构AI加速器的计算平台上进行神经网络的模型并行训练，因此，如何高效利用各种AI加速器计算资源，并实现训练任务在多种加速器上的负载均衡，一直是研究人员关心的热点问题。提出了一种面向模型并行训练的模型拆分策略自动生成方法，该方法能够基于静态的网络模型自动生成模型拆分策略，实现网络层在不同AI加速器上的任务分配。基于该方法自动生成的模型分配策略，能够高效利用单个计算平台上的所有计算资源，并保证模型训练任务在各设备之间的负载均衡，与目前使用的人工拆分策略相比，具有更高的时效性，节省拆分策略生成时间100倍以上，且降低了由于人为因素带来的不确定性。

关键词: 模型并行, 模型训练, 模型拆分, 负载均衡

Abstract: With the increase of the training data scale and the increasing complexity of the model, the training cost of the deep neural network is getting higher and higher, which requires higher computational power for the computing platform. In recent years, AI accelerators (such as FPGA, TPU, AI chip, etc.) based on heterogeneous distributed training have emerged endlessly, providing the hardware foundation for the parallelization of deep neural network. In order to make full use of all kinds of hardware resources, the researchers need to set a variety of different work force and hardware architecture AI accelerator computing platforms for neural network model training. Therefore, in the model paralle- lism training, how to efficient use all sorts of AI accelerator computing resources and realize the training mission in a variety of load balancing on the accelerator is the hot issue researchers concern about. This paper proposes a method that can automatically generate the model splitting strategy based on static network model, and map the model splitting strategy to model training, so as to realize the task assignment of network layers on different AI accelerators. The model allocation strategy automatically generated based on this method can efficiently utilize all computing resources on a single computing platform and ensure the load balancing of model training tasks among various devices. Compared with the current manual splitting strategy, it has higher timeliness, saves the generation time of the splitting strategy by more than 100 times, and reduces the uncertainty caused by human factors.

Key words: model parallelism, model training, model split, load balancing

王丽, 郭振华, 曹芳, 高开, 赵雅倩, 赵坤. 面向模型并行训练的模型拆分策略自动生成方法[J]. 计算机工程与科学, 2020, 42(09): 1529-1537.

WANG Li, GUO Zhen-hua, CAO Fang, GAO Kai, ZHAO Ya-qian, ZHAO Kun. An automatic model splitting strategy generation method for model parallel training[J]. Computer Engineering & Science, 2020, 42(09): 1529-1537.

[1]	李文佳, 史岚, 季航旭, 罗意彭. 面向Flink的负载均衡任务调度算法的研究与实现[J]. 计算机工程与科学, 2022, 44(07): 1141-1151.
[2]	徐浩桐, 黄山, 孙国璋, 贺菲莉, 段晓东, . 面向云环境的Flink负载均衡策略[J]. 计算机工程与科学, 2022, 44(05): 779-787.
[3]	罗晓霞, 王佳, 罗香玉, 李嘉楠 . 一种基于GN算法的动态图划分方法[J]. 计算机工程与科学, 2022, 44(02): 306-311.
[4]	李力, 汪硕, 黄韬, 刘韵洁, . 数据中心网络四层负载均衡技术综述[J]. 计算机工程与科学, 2022, 44(01): 48-59.
[5]	黄山, 房六一, 徐浩桐, 段晓东, . 面向容器环境的Flink的任务调度优化研究[J]. 计算机工程与科学, 2021, 43(07): 1173-1184.
[6]	陈俊彦, 李玥, 梁楚欣, 雷晓春. SDN多控制器部署及流量均衡研究[J]. 计算机工程与科学, 2021, 43(05): 830-835.
[7]	高开, 郭振华, 陈永芳, 王丽, 赵雅倩, 赵坤. 面向混合异构架构的模型并行训练优化方法[J]. 计算机工程与科学, 2021, 43(01): 42-48.
[8]	丁峻宏, 苗新强, 李根国. 面向异构超算的结构分析高效并行计算方法[J]. 计算机工程与科学, 2020, 42(12): 2133-2140.
[9]	张开琦, 刘晓燕, 王信, 吉春山, 严馨. 基于动态权重的一致性哈希微服务负载均衡优化[J]. 计算机工程与科学, 2020, 42(08): 1339-1344.
[10]	余华鸿, 周凤艳, 陈毛毛. 基于机器学习的KDD-CUP99网络入侵检测数据集的分析[J]. 计算机工程与科学, 2019, 41(增刊S1): 91-97.
[11]	刘梓璇，周建涛. 负载均衡的主导资源公平分配算法[J]. 计算机工程与科学, 2019, 41(09): 1574-1580.
[12]	孙婷婷，黄皓，王嘉伦，翁楚良. 面向CPU-GPU异构系统的数据分析负载均衡策略[J]. 计算机工程与科学, 2019, 41(03): 417-423.
[13]	谢果君，沈记全，杨焕焕. 基于柯西码的HDFS存储优化策略[J]. 计算机工程与科学, 2019, 41(03): 440-445.
[14]	陈华鹏1,2,林杰1. 基于负载基尼系数的服务网络公平均衡调度[J]. 计算机工程与科学, 2018, 40(07): 1155-1164.
[15]	樊自甫，张丹，李书. 基于软件定义网络的数据中心网络负载均衡算法研究[J]. 计算机工程与科学, 2018, 40(06): 1017-1022.