基于Spark的BIRCH算法并行化的设计与实现

计算机工程与科学

基于Spark的BIRCH算法并行化的设计与实现

李帅1，吴斌2，杜修明3，陈玉峰3

(1.北京邮电大学智能通信软件与多媒体北京重点实验室，北京 100876);

2.北京邮电大学计算机学院，北京 100876;3.国网山东省电力公司电力科学研究院，山东济南 250000)

收稿日期:2016-09-05 修回日期:2016-11-12 出版日期:2017-01-25 发布日期:2017-01-25
基金资助:
国家863计划（2015AA050204）；国网科技项目（60873120）

Design and implementation of BIRCH algorithm

parallelization based on Spark

LI Shuai1，WU Bin2，DU Xiuming3，CHEN Yufeng3

(1.Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia,

Beijing University of Posts and Telecommunicaions,Beijing 100876;

2.School of Computer Science,Beijing University of Posts and Telecommunications,Beijing 100876;

3.State Gride Shandong Electric Power Research Institute,Jinan 250000,China)

Received:2016-09-05 Revised:2016-11-12 Online:2017-01-25 Published:2017-01-25

摘要/Abstract

摘要：

在分布式计算和内存为王的时代，Spark作为基于内存计算的分布式框架技术得到了前所未有的关注与应用。着重研究BIRCH算法在Spark上并行化的设计和实现，经过理论性能分析得到并行化过程中时间消耗较多的Spark转化操作，同时根据并行化BIRCH算法的有向无环图DAG，减少shuffle和磁盘读写频率，以期达到性能优化。最后，将并行化后的BIRCH算法分别与单机的BIRCH算法和MLlib中的KMeans聚类算法做了性能对比实验。实验结果表明，通过Spark对BIRCH算法并行化，其聚类质量没有明显的损失，并且获得了比较理想的运行时间和加速比。

关键词: Spark, BIRCH并行化, 性能优化

Abstract:

In the era when distributed computing and memory highly count, the technology of memorybased distributed computing framework, such as Spark, has gained unprecedented attention and is widely applied. We design and implement the BIRCH algorithm parallelization based on Spark, which can maximize performance optimization and reduce the frequency of shuffling and disk accessing. We do some theory analysis and describe the DAG of the BIRCH based on Spark. Finally, we compare the performance of the parallelized BIRCH algorithm with the BIRCH algorithm of a single machine and the MLlib KMeans clustering algorithm. Experimental results show that the parallel BIRCH algorithm based on Spark obtains ideal running time and speedup without obvious clustering quality loss.

Key words: Spark, BIRCH parallelization, performance optimization

李帅1，吴斌2，杜修明3，陈玉峰3. 基于Spark的BIRCH算法并行化的设计与实现[J]. 计算机工程与科学.

LI Shuai1，WU Bin2，DU Xiuming3，CHEN Yufeng3.

Design and implementation of BIRCH algorithm

parallelization based on Spark

[J]. Computer Engineering & Science.

[1]	陈侨安1，李峰1，曹越1，龙明盛1,2. 基于运行数据分析的Spark任务参数优化[J]. J4, 20160101, 38(01): 11-19.
[2]	施禹, 董攀, 张利军. 一种不规则稀疏矩阵的SpMV方法[J]. 计算机工程与科学, 2024, 46(07): 1175-1184.
[3]	李飞, 郭绍忠, 周蓓, 宋广辉, 郝江伟, 许瑾晨. RISC-V基础数学库性能优化[J]. 计算机工程与科学, 2023, 45(09): 1532-1543.
[4]	王星苏, 熊文, 张瑞. 海量地铁乘客轨迹相似性连接方法：以深圳地铁为例[J]. 计算机工程与科学, 2023, 45(08): 1383-1392.
[5]	康宇晗, 时洋, 陈照云, 文梅. 面向迈创+MatrixZone异构系统的深度学习编程框架[J]. 计算机工程与科学, 2023, 45(07): 1149-1158.
[6]	莫舒恒, 卢圣有, 黄聃, 卢宇彤. 基于即时编译的GNU Octave性能优化[J]. 计算机工程与科学, 2022, 44(12): 2091-2101.
[7]	胡艳芳, 熊文, 高炜. 基于 Spark 平台的网络游戏用户流失预测方法[J]. 计算机工程与科学, 2022, 44(10): 1730-1737.
[8]	沈佳杰, 卢修文, 向望, 赵泽宇, 王新, . 分布式存储系统读写一致性算法性能优化研究综述[J]. 计算机工程与科学, 2022, 44(04): 571-583.
[9]	卞琛, 修位蓉, 于炯. 异构Spark集群数据倾斜修正调度策略[J]. 计算机工程与科学, 2022, 44(04): 620-630.
[10]	张驭洲, 曹武迪, 卜景德, 谭光明, 吉青. GROMACS 2020在ROCm平台上的移植与优化[J]. 计算机工程与科学, 2021, 43(11): 1901-1909.
[11]	周静, 关玉蓉. 基于SDN的DWSN技术分析及性能优化研究[J]. 计算机工程与科学, 2021, 43(08): 1413-1421.
[12]	朱良杰, 沈佳杰, 周扬帆, 王新, . 云际存储系统性能优化研究现状与展望[J]. 计算机工程与科学, 2021, 43(05): 761-772.
[13]	徐海坤, 匡邓晖, 刘杰, 龚春叶, . 基于RMC的蒙特卡罗程序性能优化[J]. 计算机工程与科学, 2021, 43(04): 634-640.
[14]	胡亚红1，盛夏2，毛家发1. 资源不均衡Spark环境任务调度优化算法研究[J]. 计算机工程与科学, 2020, 42(02): 203-209.
[15]	何登平1，2，3，何宗浩1,2，李培强1,2. 基于Spark的并行化高效用项集挖掘算法[J]. 计算机工程与科学, 2019, 41(10): 1723-1730.