• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于Spark的BIRCH算法并行化的设计与实现

李帅1,吴斌2,杜修明3,陈玉峰3   

  1. (1.北京邮电大学智能通信软件与多媒体北京重点实验室,北京 100876);
    2.北京邮电大学计算机学院,北京 100876;3.国网山东省电力公司电力科学研究院,山东 济南 250000)
  • 收稿日期:2016-09-05 修回日期:2016-11-12 出版日期:2017-01-25 发布日期:2017-01-25
  • 基金资助:

    国家863计划(2015AA050204);国网科技项目(60873120)

Design and implementation of BIRCH algorithm
parallelization based on Spark

LI Shuai1,WU Bin2,DU Xiuming3,CHEN Yufeng3   

  1. (1.Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia,
    Beijing University of Posts and Telecommunicaions,Beijing 100876;
    2.School of Computer Science,Beijing University of Posts and Telecommunications,Beijing 100876;
    3.State Gride Shandong Electric Power Research Institute,Jinan 250000,China)
  • Received:2016-09-05 Revised:2016-11-12 Online:2017-01-25 Published:2017-01-25

摘要:

在分布式计算和内存为王的时代,Spark作为基于内存计算的分布式框架技术得到了前所未有的关注与应用。着重研究BIRCH算法在Spark上并行化的设计和实现,经过理论性能分析得到并行化过程中时间消耗较多的Spark转化操作,同时根据并行化BIRCH算法的有向无环图DAG,减少shuffle和磁盘读写频率,以期达到性能优化。最后,将并行化后的BIRCH算法分别与单机的BIRCH算法和MLlib中的KMeans聚类算法做了性能对比实验。实验结果表明,通过Spark对BIRCH算法并行化,其聚类质量没有明显的损失,并且获得了比较理想的运行时间和加速比。

关键词: Spark, BIRCH并行化, 性能优化

Abstract:

In the era when distributed computing and memory highly count, the technology of memorybased distributed computing framework, such as Spark, has gained unprecedented attention and is  widely applied. We design and implement the BIRCH algorithm parallelization based on Spark, which can maximize performance optimization and reduce the frequency of shuffling and disk accessing. We do some theory analysis and describe the DAG of the BIRCH based on Spark. Finally, we compare the performance of the parallelized BIRCH algorithm with the BIRCH algorithm of a single machine and the MLlib KMeans clustering algorithm. Experimental results show that the parallel BIRCH algorithm based on Spark obtains ideal running time and speedup without obvious clustering quality loss.
 

Key words: Spark, BIRCH parallelization, performance optimization