异构Spark集群数据倾斜修正调度策略

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (04): 620-630.

异构Spark集群数据倾斜修正调度策略

卞琛1，修位蓉2，于炯3

（1广东金融学院互联网金融与信息工程学院，广东广州 510521;
2.广州商学院信息技术与工程学院，广东广州 511363;3新疆大学信息科学与工程学院，新疆乌鲁木齐 830046）

收稿日期:2021-11-09 修回日期:2021-12-15 接受日期:2022-04-25 出版日期:2022-04-25 发布日期:2022-04-20
基金资助:
国家自然科学基金（61862060，61902081）；广州市哲学社会科学规划项目（2021GZGJ145）

A data skew correction scheduling strategy of heterogeneous Spark cluster

BIAN Chen1，XIU Wei-rong2，YU Jiong3

(1.College of Internet Finance and Information Engineering,Guangdong University of Finance,Guangzhou 510521;
2.School of Information Technology and Engineering,Guangzhou College of Commerce,Guangzhou 511363;
3.College of Information Science and Engineering,Xinjiang University,Urumqi 830046,China)Abstract:Due to the barrel effect of heterogeneous Spark clusters, unreasonable parallelism leads to poor adaptation of task allocation to the power of workers, which affects cluster computational efficiency and resource utilization. Aiming at this issue, the node resource models are firstly established, the coupling relationships among data distribution, parallelism parameters and task allocation are analyzed, and the optimization objective of the algorithm is proposed. The data skew correction scheduling strategy DSCS for heterogeneous Spark cluster is designed, which includes the parallelism prediction algorithm, data skew correction algorithm and heterogeneous task allocation algorithm. The prediction algorithm sets the parallelism degree in advance, the data skew correction algorithm performs data re-partitioning and parallelism correction according to the statistical information of the first stage, and the heterogeneous task allocation algorithm reasonably allocates the tasks to the workers according to their computing capabilities in the heterogeneous cluster, so as to improve the adaptability of data volume and workers power and optimize the overall performance of Spark cluster. The experimental results show that the algorithm achieves performance improvement under different job types and dataset conditions, and can effectively reduce the probability of spill to the external storage of workers.

Received:2021-11-09 Revised:2021-12-15 Accepted:2022-04-25 Online:2022-04-25 Published:2022-04-20

摘要/Abstract

摘要： 异构Spark集群存在木桶效应，不合理的并行度导致任务分配与工作节点计算能力的适配性较差，进而影响集群计算效率和资源利用率。针对这一问题，首先建立模型，分析数据分布、并行度参数和节点任务分配的耦合关系，提出算法的优化目标，设计异构Spark集群的数据倾斜修正调度策略DSCS，包括并行度预估算法、数据倾斜修正算法和异构节点任务分配算法。预估算法对并行度进行先期设定，数据倾斜修正算法根据首个计算阶段的统计信息进行数据重新划分和并行度修正，由异构节点任务分配算法对集群不同计算能力的工作节点进行合理的任务分配，从而提高数据计算量与节点计算能力的适配性，优化Spark集群的整体性能。实验结果表明：在不同作业类型、不同数据集条件下，算法均取得了一定的性能提升，并能有效减少工作节点外存溢写的概率。

关键词: Spark, parallel scheduling, data partitioning, heterogeneous cluster, data skew

Abstract: Spark；并行调度；数据分配；异构集群；数据倾斜

卞琛, 修位蓉, 于炯. 异构Spark集群数据倾斜修正调度策略[J]. 计算机工程与科学, 2022, 44(04): 620-630.

BIAN Chen, XIU Wei-rong, YU Jiong. A data skew correction scheduling strategy of heterogeneous Spark cluster[J]. Computer Engineering & Science, 2022, 44(04): 620-630.