• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (01): 42-48.

• 高性能计算 • 上一篇    下一篇

面向混合异构架构的模型并行训练优化方法

高开1,郭振华1,陈永芳1,王丽1,赵雅倩1,赵坤2   

  1. (1.浪潮电子信息产业股份有限公司高效能服务器与存储技术国家重点实验室,山东 济南 250000;
    2.广东浪潮大数据研究有限公司,广东 广州 510000)
  • 收稿日期:2020-04-15 修回日期:2020-06-22 接受日期:2021-01-25 出版日期:2021-01-25 发布日期:2021-01-22
  • 基金资助:
    山东省重点研发计划(重大科技创新工程)(2019JZZY011101);山东省自然科学基金(ZR2018BF011)

A model parallel training optimization algorithm for hybrid heterogeneous platforms 

GAO Kai1,GUO Zhen-hua1,Chen Yong-fang1,Wang Li1,ZHAO Ya-qian1,ZHAO Kun2   

  1. (1.State Key Laboratory of High-End & Storage Technology,Inspur Electronic Information Industry Co.,Ltd.,Jinan 250000;

    2.Guangdong Inspur Big Data Research Co.,Ltd.,Guangzhou 510000,China)


  • Received:2020-04-15 Revised:2020-06-22 Accepted:2021-01-25 Online:2021-01-25 Published:2021-01-22

摘要: 随着混合异构平台的发展,出现了类型不一的加速设备,如何在混合异构平台中充分利用这些不同类型的设备,以及如何在多个计算设备之间部署深度学习模型,而且训练大型和复杂模型的重要性不断提高。数据并行(DP)是应用最广泛的并行化策略,但是如果数据并行训练中的设备数量不断增加,设备之间的通信开销就会成为瓶颈。此外,每个步骤因设备性能差异处理的批总量不同会导致精度损失,即需要更长的训练周期以收敛到期望的精度。这些因素会影响整体训练时间,并且会影响某些设备的运行效率。除了数据并行(DP),每个训练步骤都可以通过模型并行(MP)来加速。提出了一种适合混合异构平台的模型并行训练优化算法。首先,为解决混合异构平台中设备性能分布不均问题,提出了层级并行和通道并行混合的模型并行划分策略,同时通过合并一些性能偏低的设备来减少流水线的长度和缓解通信压力。然后为了优化设备间的流水效果,通过分析流水线建立时间占比和设备性能利用率对整体训练时间的影响,提出了一种可以使两者达到均衡状态的微批次划分方法。实验表明,通过本文方法优化之后的模型并行流水训练算法比传统的模型并行算法具有更好的加速比,在单一类型设备的异构平台上的训练性能加速比提升4%左右,在混合异构平台的训练性能加速比要比没有使用优化方法之前提升7%左右。


关键词: 混合异构, 模型并行, 微批次, 设备差异

Abstract: With the development of hybrid heterogeneous platforms, different types of acceleration devices have appeared. How to make full use of these different types of devices in hybrid heterogeneous platforms and how to deploy deep learning models among multiple computing devices to train large and complex models is becoming more and more important. Data parallelism (DP) is the most widely used parallelization strategy, but if the device number in data parallel training continues to grow, the communication overhead between devices will become a bottleneck. In addition, the total amount of batches processed in each step due to the difference in device performance will lead to a loss of accuracy, that is, a larger training period is required to converge to the desired accuracy. These factors will affect the overall training time and will affect the operating efficiency of certain equipment. Except for data parallelism (DP), each training step can be accelerated by model parallelism (MP). This paper proposes a model parallel training optimization algorithm suitable for hybrid heterogeneous platforms. First of all, in order to solve the problem of uneven distribution of device performance in hybrid heterogeneous platforms, this paper proposes a parallel division strategy of mixed hierarchical and channel parallel models. At the same time, it combines some low-performance devices to reduce the length of the pipeline and ease the communication pressure. Then, in order to optimize the pipeline effect between devices, by analyzing the influence of pipeline establishment time and device performance utilization on the overall training time, this paper proposes a micro-batch division method that can balance the two parts. Experiments prove that the proposed model parallel training optimization algorithm has a better speedup than the traditional model parallel algorithm. The training performance speedup on the heterogeneous platform of single type devices is increased by about 4%. The platform's training performance speedup can be increased by about 7% compared to the previous optimization method.




Key words: hybrid heterogeneous, model parallel, micro-batch, device difference