• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

Memory optimization of Spark parallel computing framework

LIAO Wangjian1,2,HUANG Yongfeng1,2,BAO Congkai1,2   

  1. (1.Department of Electronic Engineering,Tsinghua University,Beijing 100084;
    2.National Laboratory for Information Science and Technology(TNList),Tsinghua University,Beijing 100084,China)
  • Received:2016-11-16 Revised:2017-03-27 Online:2018-04-25 Published:2018-04-25

Abstract:

The cluster parallel computing framework represented by Spark is widely used in the big data and cloud computing, and its performance optimization is the key in applications.The paper analyzes the framework of the execution process and memory management mechanism of Spark framework. Combining the characteristics of Spark and JVM memory management,three strategies are proposed:(1) Serialization and compression are used to reduce the cache data size and reduce the occupied memory space, then reduce the GC consumption, thus improving the performance.(2) The running memory size is reduced within a certain range, and recalculation replaces the cache, thus improving the performance. (3)By adjusting the proportion of the old generation and new generation of the JVM,the ratio of Spark computing and cache space,and other memory allocation parameters, the performance can be improved greatly.Experiments show that the serialization and compression can reduce the cache space by 42%,the performance is increased by 21% when the submitting memory is reduced from 1 000 MB to 800 MB, and optimizing the memory ratio can improve the performance by 10% to 30%.
 

Key words: Spark, performance optimization, heap memory