• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (06): 951-961.

• 高性能计算 • 上一篇    下一篇

分布式环境中的多作业执行调度策略与优化

季航旭1,姜苏1,赵宇海1,吴刚1,王国仁2   

  1. (1.东北大学计算机科学与工程学院,辽宁 沈阳 110819;2.北京理工大学计算机学院,北京 100081)
  • 收稿日期:2020-10-03 修回日期:2020-12-30 接受日期:2021-06-25 出版日期:2021-06-25 发布日期:2021-06-22
  • 基金资助:
    科技部重点研发项目(2018YFB1004402)

Scheduling and optimization of multi-job execution in distributed environment

JI Hang-xu1,JIANG Su1,ZHAO Yu-hai1,WU Gang1,WANG Guo-ren2   

  1. (1.School of Computer Science and Engineering,Northeastern University,Shenyang 110819;

    2.School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China)

  • Received:2020-10-03 Revised:2020-12-30 Accepted:2021-06-25 Online:2021-06-25 Published:2021-06-22

摘要: 分布式大数据计算引擎是科研机构、互联网企业和政府部门处理大规模数据必不可少的工具,它们的使用和推广促进了各个领域的快速发展,为社会进步做出了巨大贡献。但是,在多作业处理的情况下,目前主流的大数据计算引擎在资源分配和作业调度方面仍有许多不足之处,它们通常对多作业平均划分内存资源并以先进先出FIFO的方式调度作业,这样简单的资源划分方式和作业调度机制并不能充分利用系统性能。针对此问题,从计算引擎的作业层面做出了改进:在资源划分方面,通过提取作业特征对作业的任务量进行预估,判断作业任务量和作业预分配资源间的差异,合并对集群资源浪费较高的作业,充分利用计算资源;在作业调度方面,对作业池中的作业进行特征提取,使用多路K-means算法对作业进行聚类分析,然后基于分析的结果,使用自平衡轮询调度算法对作业进行调度,达到负载均衡的目的。为了验证所提算法的有效性,使用大规模文本数据集在分布式集群环境中进行对比实验,实验结果表明,提出的作业合并算法和多作业调度算法可以减少5%~23%的作业运行时间,提高了7.5%~29%的系统吞吐量,在最好情况下可减少40%的线程启动数。

关键词: 分布式, 作业合并, 聚类, 轮询调度, Flink

Abstract: Distributed big data computing engines are indispensable tools for scientific research institutions, Internet companies, and government departments to process large-scale data. Their use and promotion have promoted the rapid development of various fields and made great contributions to social progress. However, in the case of multi-job processing, the current mainstream big data computing engines still have many shortcomings in resource allocation and job scheduling. They usually divide multi-jobs into memory resources equally and use first-input-first-output (FIFO) method for scheduling jobs, such a simple resource partitioning method and job scheduling mechanism cannot give full play to system performance. In response to this problem, improvements have been made from the job level of the computing engine: (1) in terms of resource division, the task amount of the job is estimated to judge the difference between the task amount and the pre-allocated resources of job, and the jobs with high waste of cluster resources are merged to fully utilize the computing resources by the extraction of job features; (2) in terms of job scheduling, the features of the jobs in the job pool are extracted so that cluster analysis is conducted for the jobs by multipath K-means algorithm, and then self-balancing polling scheduling algorithm is used to schedule the jobs based on the analyzed results to achieve the load balance. In order to verify the effectiveness of the proposed algorithm, comparative experiments were conducted in a distributed cluster environment using large-scale text data sets. The experimental results show that the proposed job merging algorithm and multi-job scheduling algorithm can reduce the job running time by 5% to 23%, improves the system throughput by 7.5%~29%, and reduce the number of threads started by 40% in the best case.


Key words: distributed, job merging, cluster, polling scheduling, Flink