• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

Spark并行计算框架的内存优化

廖旺坚1,2,黄永峰1,2,包从开1,2   

  1. (1.清华大学电子工程系,北京 100084;2.清华大学信息科学与技术国家实验室(筹),北京 100084)
  • 收稿日期:2016-11-16 修回日期:2017-03-27 出版日期:2018-04-25 发布日期:2018-04-25
  • 基金资助:

    国家科技支撑计划(2014BAH41B00);国家自然科学基金(U1405254,U1536207)

Memory optimization of Spark parallel computing framework

LIAO Wangjian1,2,HUANG Yongfeng1,2,BAO Congkai1,2   

  1. (1.Department of Electronic Engineering,Tsinghua University,Beijing 100084;
    2.National Laboratory for Information Science and Technology(TNList),Tsinghua University,Beijing 100084,China)
  • Received:2016-11-16 Revised:2017-03-27 Online:2018-04-25 Published:2018-04-25

摘要:

以Spark为代表的集群并行计算框架在大数据、云计算浪潮中广泛应用,其运行性能优化是应用的关键。为提高运行性能,分析了Spark框架执行流程、内存管理机制,结合Spark和JVM两个层面内存管理的特点,提出3条优化策略:(1)通过序列化和压缩方式减少缓存数据大小,使得GC消耗降低,提升性能;(2)在一定范围内减少运行内存大小,用重算代替缓存,可以提升性能;(3)配置适当的JVM新生代和老生代的比例、Spark计算与缓存空间比例等内存分配参数,能够较大程度地提升性能。实验结果表明,序列化和压缩能够减少缓存占用空间42%;提交运行内存由1 000 MB减少到800 MB时,性能增加21%;优化内存配比,性能比默认参数有10%~30%的提升。

关键词: Spark, 性能优化, 堆内存

Abstract:

The cluster parallel computing framework represented by Spark is widely used in the big data and cloud computing, and its performance optimization is the key in applications.The paper analyzes the framework of the execution process and memory management mechanism of Spark framework. Combining the characteristics of Spark and JVM memory management,three strategies are proposed:(1) Serialization and compression are used to reduce the cache data size and reduce the occupied memory space, then reduce the GC consumption, thus improving the performance.(2) The running memory size is reduced within a certain range, and recalculation replaces the cache, thus improving the performance. (3)By adjusting the proportion of the old generation and new generation of the JVM,the ratio of Spark computing and cache space,and other memory allocation parameters, the performance can be improved greatly.Experiments show that the serialization and compression can reduce the cache space by 42%,the performance is increased by 21% when the submitting memory is reduced from 1 000 MB to 800 MB, and optimizing the memory ratio can improve the performance by 10% to 30%.
 

Key words: Spark, performance optimization, heap memory