• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2016, Vol. 38 ›› Issue (01): 11-19.

• 论文 • 上一篇    下一篇

基于运行数据分析的Spark任务参数优化

陈侨安1,李峰1,曹越1,龙明盛1,2   

  1. (1.清华大学软件学院,北京 100084;2.清华大学信息科学与技术国家实验室(筹),北京 100084)
  • 收稿日期:2015-10-02 修回日期:2015-12-10 出版日期:2016-01-25 发布日期:2016-01-25
  • 基金资助:

    清华大学信息科学与技术国家实验室大数据科学与技术专项(面向领域的大数据应用系统开发与运行平台)

Parameter optimization for Spark jobs based
on runtime data analysis        

CHEN Qiaoan1,LI Feng1,CAO Yue1,LONG Mingsheng1,2   

  1. (1.School of Software,Tsinghua University,Beijing 100084;2.National Laboratory for Information Science and Technology (TNList),Tsinghua University,Beijing 100084,China)
  • Received:2015-10-02 Revised:2015-12-10 Online:2016-01-25 Published:2016-01-25

摘要:

运行数据是大数据系统中增长最快、最为复杂也是最有价值的数据资源之一。基于运行数据,软件开发者可以分析关于软件质量和开发模型的重要信息。Spark作为一个分布式系统,在运行过程中会产生大量的运行数据,包括日志数据、监控数据以及任务图数据。开发者可以基于运行数据对系统进行参数调优。然而该系统所涉及的参数种类繁多、影响多样且难以评估,若对系统了解不足,进行参数调优存在较大的困难。提出运行数据历史库的概念,历史库中存储的是以往运行任务的特征信息以及运行配置信息。同时提出了基于历史库搜索的参数优化模型,并实验验证了本文提出的参数优化模型对用户任务性能提升具有较好的效果。

关键词: 大数据, 运行数据, 数据分析, 参数优化, Spark

Abstract:

The fast growing runtime data is one of the most complicated and valuable data resources in big data systems. Based on runtime data, developers can analyze software quality and discover important information on software development model. As a distributed system, Spark generates a large amount of runtime data during running user applications. Those runtime data include log data, monitoring data and graph representation of jobs. Developers can optimize system parameters with the help of runtime data. However, there are different types of parameters in Spark and it is difficult to identify the effects of the parameters, which makes them hard to tune. In this paper we propose the concept of runtime data historical database and a parameters optimization model based on searching the database. Experimental results validate that the proposed optimization model achieves good performance on the recommendation of system parameters.

Key words: big data;runtime data;data analysis;parameters optimization;Spark