• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2022, Vol. 44 ›› Issue (04): 620-630.

• High Performance Computing • Previous Articles     Next Articles

A data skew correction scheduling strategy of heterogeneous Spark cluster

BIAN Chen1,XIU Wei-rong2,YU Jiong3   

  1. (1.College of Internet Finance and Information Engineering,Guangdong University of Finance,Guangzhou 510521;
    2.School of Information Technology and Engineering,Guangzhou College of Commerce,Guangzhou 511363;
    3.College of Information Science and Engineering,Xinjiang University,Urumqi 830046,China)Abstract:Due to the barrel effect of heterogeneous Spark clusters, unreasonable parallelism leads to poor adaptation of task allocation to the power of workers, which affects cluster computational efficiency and resource utilization. Aiming at this issue, the node resource models are firstly established, the coupling relationships among data distribution, parallelism parameters and task allocation are analyzed, and the optimization objective of the algorithm is proposed. The data skew correction scheduling strategy DSCS for heterogeneous Spark cluster is designed, which includes the parallelism prediction algorithm, data skew correction algorithm and heterogeneous task allocation algorithm. The prediction algorithm sets the parallelism degree in advance, the data skew correction algorithm performs data re-partitioning and parallelism correction according to the statistical information of the first stage, and the heterogeneous task allocation algorithm reasonably allocates the tasks to the workers according to their computing capabilities in the heterogeneous cluster, so as to improve the adaptability of data volume and workers power and optimize the overall performance of Spark cluster. The experimental results show that the algorithm achieves performance improvement under different job types and dataset conditions, and can effectively reduce the probability of spill to the external storage of workers.



  • Received:2021-11-09 Revised:2021-12-15 Accepted:2022-04-25 Online:2022-04-25 Published:2022-04-20

Abstract: Spark;并行调度;数据分配;异构集群;数据倾斜