• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2020, Vol. 42 ›› Issue (09): 1529-1537.

Previous Articles     Next Articles

An automatic model splitting strategy generation method for model parallel training

WANG Li1,GUO Zhen-hua1,CAO Fang1,GAO Kai1,ZHAO Ya-qian1,ZHAO Kun2   

  1. (1.State Key Laboratory of High-End & Storage Technology,Inspur Electronic Information Industry Co.Ltd.,Jinan 250000;

    2.Guangdong Inspur Big Data Research Co.Ltd.,Guangzhou 510000,China)

  • Received:2020-04-08 Revised:2020-06-11 Accepted:2020-09-25 Online:2020-09-25 Published:2020-09-24

Abstract: With the increase of the training data scale and the increasing complexity of the model, the training cost of the deep neural network is getting higher and higher, which requires higher computational power for the computing platform. In recent years, AI accelerators (such as FPGA, TPU, AI chip, etc.) based on heterogeneous distributed training have emerged endlessly, providing the hardware foundation for the parallelization of deep neural network. In order to make full use of all kinds of hardware resources, the researchers need to set a variety of different work force and hardware architecture AI accelerator computing platforms for neural network model training. Therefore, in the model paralle- lism training, how to efficient use all sorts of AI accelerator computing resources and realize the training mission in a variety of load balancing on the accelerator is the hot issue researchers concern about. This paper proposes a method that can automatically generate the model splitting strategy based on static network model, and map the model splitting strategy to model training, so as to realize the task assignment of network layers on different AI accelerators. The model allocation strategy automatically generated based on this method can efficiently utilize all computing resources on a single computing platform and ensure the load balancing of model training tasks among various devices. Compared with the current manual splitting strategy, it has higher timeliness, saves the generation time of the splitting strategy by more than 100 times, and reduces the uncertainty caused by human factors.



Key words: model parallelism, model training, model split, load balancing