• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

Linux内核参数对Spark负载性能影响的研究

王利1,2,王晶1,2,张伟功2,3,邱柯妮2,3,陆克中4   

  1. (1.首都师范大学北京成像技术高精尖创新中心,北京 100048;2.首都师范大学信息工程学院,北京 100048;
    3.首都师范大学高可靠嵌入式系统技术北京市工程研究中心,北京 100048;
    4.深圳大学计算机与软件学院,广东 深圳 518060)
  • 收稿日期:2017-01-03 修回日期:2017-03-02 出版日期:2017-07-25 发布日期:2017-07-25
  • 基金资助:

    国家自然科学基金(61472260,61402302,61502321);北京市创新团队计划(IDHT20150507);北京市科技计划(KM201610028016);广东省省部产学研项目(2013B090500055);深圳市基础研究学科布局项目(JCYJ20150529164656096);国家863计划(2015AA015305)

Impact of Linux kernel parameters on Spark workloads

WANG  Li1,2,WANG Jing1,2,ZHANG Wei-gong2,3,QIU Ke-ni2,3,LU Ke-zhong4   

  1. (1.Beijing Advanced Innovation Center for Imaging Technology,Capital Normal University,Beijing 100048;
    2.College of Information Engineering,Capital Normal University,Beijing 100048;
    3.Beijing Engineering Research Center of High Reliable Embedded System,Capital Normal University,Beijing 100048;
    4.College of Computer Science & Software Engineering,Shenzhen University,Shenzhen 518060,China)
  • Received:2017-01-03 Revised:2017-03-02 Online:2017-07-25 Published:2017-07-25

摘要:

关于Spark性能的研究目前正在成为热点,但调优策略多位于应用层,而不是系统层。操作系统作为硬件之上的第一层软件,对硬件性能发挥起着根本作用。Linux内核提供了丰富的参数作为优化性能的接口,但实际中,这些参数的作用并没有得到充分发挥。人们更多是采用系统默认值,而不是根据具体环境进行调整。然而本文实验发现,系统默认值并不一定是最好的选择,有时甚至是最坏的。定义了“影响比”这一概念,并基于此概念提出了一种通过分析内核函数的执行情况来认识参数对Spark应用影响的方法。针对Spark内存计算的特点,从大页、NUMA这两个与使用内存紧密相关的方面分析了相关内核参数对几种典型Spark负载的性能影响,并由此得出一些结论。希望本文的分析和结论可以为Spark平台合理设置内核参数提供一些参考。

 

关键词: 大数据, Spark, Linux, 大页, NUMA

Abstract:

Research on the performance of Spark becomes a hot topic, however, optimization strategies are mostly used on the application level instead of system level. As the first software above hardware, the operating system plays a fundamental role in the performance of hardware. The Linux kernel provides abundant parameters as the interface to optimize the performance of the system. However, in practice, kernel parameters have not fully played their roles. Most people use their default values rather than change them to fit the specific environment. However, our experiments prove that the default values are not always the best choice, and sometimes it is even the worst. We define the concept of "influence ratio", and put forward a method based on the concept to understand the influence of parameters on Spark applications by analyzing the kernel functions. According to the features of the memory computing of Spark, we analyze the influence of Linux kernel parameters on several typical Spark workloads from the aspects of Transparent Huge Page and NUMA, which closely relates to the use of memory, and then give some conclusions. We hope that the analysis and conclusions can provide some experience of tuning kernel parameters reasonably for the Spark platform.

Key words: big data, spark, Linux, huge page, NUMA