• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

Design and implementation of BIRCH algorithm
parallelization based on Spark

LI Shuai1,WU Bin2,DU Xiuming3,CHEN Yufeng3   

  1. (1.Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia,
    Beijing University of Posts and Telecommunicaions,Beijing 100876;
    2.School of Computer Science,Beijing University of Posts and Telecommunications,Beijing 100876;
    3.State Gride Shandong Electric Power Research Institute,Jinan 250000,China)
  • Received:2016-09-05 Revised:2016-11-12 Online:2017-01-25 Published:2017-01-25

Abstract:

In the era when distributed computing and memory highly count, the technology of memorybased distributed computing framework, such as Spark, has gained unprecedented attention and is  widely applied. We design and implement the BIRCH algorithm parallelization based on Spark, which can maximize performance optimization and reduce the frequency of shuffling and disk accessing. We do some theory analysis and describe the DAG of the BIRCH based on Spark. Finally, we compare the performance of the parallelized BIRCH algorithm with the BIRCH algorithm of a single machine and the MLlib KMeans clustering algorithm. Experimental results show that the parallel BIRCH algorithm based on Spark obtains ideal running time and speedup without obvious clustering quality loss.
 

Key words: Spark, BIRCH parallelization, performance optimization