• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2014, Vol. 36 ›› Issue (04): 571-578.

• 论文 •    下一篇

最小化多MapReduce任务总完工时间的分析模型及其应用

田文洪1,2,陈瑜2,王心阳2,薛瑞尼2,赵勇2   

  1. (1.电子科技大学信息与软件工程学院,四川 成都 610054;
    2.电子科技大学计算机科学与工程学院,四川 成都  611731)
  • 收稿日期:2013-07-10 修回日期:2013-09-08 出版日期:2014-04-25 发布日期:2014-04-25
  • 基金资助:

    国家自然科学基金资助项目(61150110486,61272528);中央高校基金资助项目(IDZYGX2013J073);2013年CCF腾讯科研基金资助项目

An analytical model and its applications for
minimizing total makespan of multiple MapReduce jobs                 

TIAN Wenhong1,2,CHEN  Yu2,WANG Xinyang2,XUE Ruini2,ZHAO Yong2   

  1. (1.School of Information and Software Engineering,University of Electronic Science and Technology of China,Chengdu 610054;
    2.School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731,China)
  • Received:2013-07-10 Revised:2013-09-08 Online:2014-04-25 Published:2014-04-25

摘要:

随着大规模的MapReduce集群广泛地用于大数据处理,特别是当有多个任务需要使用同一个Hadoop集群时,一个关键问题是如何最大限度地减少集群的工作时间,提高MapReduce作业的服务效率。可将多个MapReduce作业当做一个调度任务建模,观察发现多个任务的总完工时间和任务的执行顺序有密切关系。 研究目标是设计作业调度系统分析模型,最小化一批MapReduce作业的总完工时间。提出一个更好的调度策略和实现方法, 使整个调度系统符合经典Johnson算法的条件, 从而可使用经典Johnson算法在线性时间内获取总完工时间的最优解。同时,针对需要使用两个或多个资源池进行平衡的问题, 提出了一种线性时间解决方案, 优于已知的近似模拟方案。该理论模型可应用于提高系统响应速度、节能和负载均衡等方面, 对应的应用实例提供了证实。

关键词: Hadoop, MapReduce, 批量作业, 调度优化, 最小化总完工时间

Abstract:

As large-scale MapReduce clusters become widely adapted to process huge amount of data, one of critical challenges is to improve the service quality of MapReduce clusters by minimizing their makespan. A scheduling model can be considered for multiple MapReduce jobs. It is observed that the order in which these jobs are executed can have a significant impact on their overall makespan. The goal of the paper is to design a framework of automatic job scheduler and propose an analytical model for minimizing the makespan of such a set of MapReduce jobs. By considering a better strategy and implementation, we can meet the conditions of the classical Johnson algorithm and use it to find the optimal solution. Under our proposed new strategy, solving the balanced pools problem becomes exact in linear time, better than existing simulating approaches. Our proposed analytical results can be applied to improve system response time, energyefficiency and load-balance in Hadoop cluster pools, while corresponding numerical examples validate our observations.

Key words: Hadoop;MapReduce;batch workloads;optimized schedule;minimized makespan