异构微差同步并行训练算法

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (11): 1949-1959.

异构微差同步并行训练算法

黄山1,2,3,吴煜凡1,2,3,吕鹤轩1,2,3,段晓东1,2,3

(1.大连民族大学计算机科学与工程学院,辽宁大连 116650;
2.大数据应用技术国家民委重点实验室（大连民族大学）,辽宁大连 116650;
3.大连市民族文化数字技术重点实验室（大连民族大学）,辽宁大连 116650)

收稿日期:2023-12-19 修回日期:2024-01-18 接受日期:2024-11-25 出版日期:2024-11-25 发布日期:2024-11-27
基金资助:
国家重点研发计划云计算和大数据重点专项(2018YFB1004402)

A heterogeneous differential synchronous parallel training algorithm

HUANG Shan1,2,3，WU Yu-fan1,2,3，L He-xuan1,2,3，DUAN Xiao-dong1,2,3

(1.College of Computer Science and Technology,Dalian Minzu University,Dalian 116650;
2.State Ethnic Affairs Commission Key Laboratory of Big Data Applied Technology,Dalian 116650;
3.Dalian Key Laboratory of Digital Technology for National Culture,Dalian 116650,China)

Received:2023-12-19 Revised:2024-01-18 Accepted:2024-11-25 Online:2024-11-25 Published:2024-11-27

摘要/Abstract

摘要： 前馈神经网络BPNN因具有非线性能力强、自学习能力强、自适应能力强以及容错能力强等优点，被广泛应用于行为识别和预测等领域。随着模型的升级优化和数据量的快速增长，基于大数据分布式计算框架的并行训练架构成为主流。ApacheFlink作为新一代大数据计算框架，因其具有高吞吐量、低时延等特点而被广泛应用。硬件设备更新换代速度的加快以及购买批次不同导致现实生活中Flink集群大多数为异构集群，意味着集群中的计算资源不均衡。现有的BPNN并行训练模型无法解决因计算资源不均衡带来的训练过程中高性能节点空转的问题。此外，异构环境下BPNN的并行训练还存在节点数量增加，节点间的通信开销也随之增加的问题。传统的小批量梯度下降方法拥有较好的寻优效果，但随机的初始化模型和小批量的梯度下降特点导致了BPNN并行化训练出现收敛速度缓慢的问题。针对以上问题，为加快异构环境下BPNN并行化训练速度，提高BPNN并行训练效率，提出了异构微差同步并行训练算法。该算法能够针对异构环境下节点性能不同的情况，对节点性能进行评分，并实时地通过数据分区模块动态地按比例分配数据，使节点性能和节点分配数据量成正比，从而减少高性能节点空转时长。

关键词: Flink, BPNN, 并行训练, 异构环境

Abstract: Back propagation neural network (BPNN) is widely used in fields such as behavior recognition and prediction due to its advantages including strong nonlinearity, self-learning capability, adaptability, and robust fault tolerance. With the upgrade and optimization of models and the accelerated growth of data volume, parallel training architectures based on big data distributed computing frameworks have become mainstream. Apache Flink, as a new generation of big data computing frameworks, is widely applied due to its high throughput and low latency characteristics. However, due to the accelerated pace of hardware upgrades and different purchase batches, Flink clusters in real-life scenarios are mostly heterogeneous, meaning that computing resources within the cluster are unbalanced. Existing BPNN parallel training models cannot address the issue of high-performance nodes idling during the training process due to this unbalanced computing resource distribution. Additionally, in a heterogeneous environment, as the number of nodes increases, so does the communication overhead between nodes during BPNN parallel training. The traditional mini-batch gradient descent method possesses precise optimization capabilities, but the combination of random model initialization and precise mini-batch gradient descent characteristics leads to slow convergence speeds in BPNN parallel training. To address the aforementioned issues, this paper aims to accelerate BPNN parallel training speed and improve BPNN parallel training efficiency in a heterogeneous environment by proposing the heterogeneous micro-difference synchronous parallel training (HMDSPT) algorithm. This algorithm scores node performance based on variations in performance within a heterogeneous environment and dynamically allocates data in proportion through a data partitioning module in real-time, ensuring that node performance is directly proportional to the amount of data allocated to each node. This approach reduces the idling time of high- performance nodes.

Key words: Flink, back propagation neural network(BPNN), parallel training, heterogeneous environment

黄山, 吴煜凡, 吕鹤轩, 段晓东, . 异构微差同步并行训练算法[J]. 计算机工程与科学, 2024, 46(11): 1949-1959.

HUANG Shan, WU Yu-fan, L He-xuan, DUAN Xiao-dong, . A heterogeneous differential synchronous parallel training algorithm[J]. Computer Engineering & Science, 2024, 46(11): 1949-1959.

编辑推荐

Metrics

阅读次数

全文

169

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	169

来源	本网站	其他网站

次数	109	60
比例	64%	36%

摘要

最新录用	在线预览	正式出版

0	0	78

	来源	本网站

	次数	78
	比例	100%

[1]	徐浩桐, 黄山, 孙国璋, 贺菲莉, 段晓东, . 面向云环境的Flink负载均衡策略[J]. 计算机工程与科学, 2022, 44(05): 779-787.
[2]	颜子杰，陈孟强，吴维刚. 基于训练数据动态分配的深度学习并行优化机制[J]. 计算机工程与科学, 2018, 40(增刊S1): 141-144.
[3]	柳松王展. 基于径向基概率神经网络的人脸识别方法[J]. J4, 2006, 28(2): 57-60.