基于Hash改进的k-means算法并行化设计

计算机工程与科学

基于Hash改进的k-means算法并行化设计

张波，徐蔚鸿，陈沅涛，朱玲

（长沙理工大学计算机与通信工程学院，湖南长沙 410114）

收稿日期:2015-07-07 修回日期:2015-09-23 出版日期:2016-10-25 发布日期:2016-10-26
基金资助:
国家自然科学基金（61402053）；湖南省科技计划（2014SK3080）；湖南省教育厅优秀青年项目（14B005）

A k-means clustering algorithm

parallelization design based on Hash

ZHANG Bo,XU Weihong,CHEN Yuantao,ZHU Ling

（School of Computer &Communication Engineering,Changsha University of Science &Technology,Changsha 410114,China）

Received:2015-07-07 Revised:2015-09-23 Online:2016-10-25 Published:2016-10-26

摘要/Abstract

摘要：

为了解决kmeans算法在Hadoop平台下处理海量高维数据时聚类效果差，以及已有的改进算法不利于并行化等问题，提出了一种基于Hash改进的并行化方案。将海量高维的数据映射到一个压缩的标识空间，进而挖掘其聚类关系，选取初始聚类中心，避免了传统kmeans算法对随机选取初始聚类中心的敏感性，减少了kmeans算法的迭代次数。又结合MapReduce框架将算法整体并行化，并通过Partition、Combine等机制加强了并行化程度和执行效率。实验表明，该算法不仅提高了聚类的准确率和稳定性，同时具有良好的处理速度。

关键词: 海量数据, Hadoop, Hash, 并行kmeans聚类, 中心选取

Abstract:

As the traditional kmeans algorithm has poor clustering effect when dealing with massive volume and high dimensional data, and the existing optimization algorithms are not conductive to parallelization, we propose a parallel optimization scheme based on Hash algorithm. We firstly map the massive volume and high dimensional data to a compressed identifier space, then mine the clustering relationship and select the initial clustering center. These steps avoid the sensitivity of the kmeans algorithm to the random selection of the initial clustering center, and reduce the number of iterations. Finally, combined with the MapReduce, the Partition and Combine mechanisms are applied to optimize the parallelization of this algorithm, thus the degree of parallelization and execution efficiency are more strengthened. Experimental results show that the proposed algorithm can improve the clustering accuracy and stability, and has good processing performance as well.

Key words: massive data, Hadoop, Hash, parallel kmeans clustering, center selection

张波，徐蔚鸿，陈沅涛，朱玲. 基于Hash改进的k-means算法并行化设计[J]. 计算机工程与科学.

ZHANG Bo,XU Weihong,CHEN Yuantao,ZHU Ling.

A k-means clustering algorithm

parallelization design based on Hash

[J]. Computer Engineering & Science.

编辑推荐

Metrics

阅读次数

全文

204

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	204	0	0

来源	本网站	其他网站

次数	175	29
比例	86%	14%

摘要

134

最新录用	在线预览	正式出版

133	0	0

	来源	本网站

	次数	134
	比例	100%

[1]	苏丽，孙彦猛，张博为，杨先博，朱颖. 一种基于Hadoop+CUDA实现相关器的方法[J]. J4, 20160101, 38(01): 46-51.
[2]	王永军, 刘瀚阳, 王辉, 申自浩, 刘琨, 刘沛骞. 区域敏感的群智感知隐私保护任务分配机制[J]. 计算机工程与科学, 2024, 46(08): 1414-1424.
[3]	赵俊生, 王鑫宇, 尹玉洁, 张林. 基于蒙古语新闻领域本体的分布式检索方法[J]. 计算机工程与科学, 2021, 43(03): 560-570.
[4]	李丹枫, 王飞, 赵国鸿. 一种大流量报文HMAC-SM3认证实时加速引擎[J]. 计算机工程与科学, 2021, 43(01): 82-88.
[5]	杨青1,2,3，张亚文1,2，张琴1，袁佩玲1. 基于Hadoop的多维关联规则挖掘算法研究及应用[J]. 计算机工程与科学, 2019, 41(12): 2127-2133.
[6]	曹守启，孙青，曹莉凌. 动态ID多因素远程用户身份认证方案的改进[J]. 计算机工程与科学, 2019, 41(04): 633-640.
[7]	王永坤1,罗萱1,金耀辉1,2. 基于私有云和物理机的混合型大数据平台设计及实现[J]. 计算机工程与科学, 2018, 40(02): 191-199.
[8]	刘鹏1,2，叶帅3，孟磊1,2，王灿4. 基于Spark的并行遗传算法求解多峰函数极值[J]. 计算机工程与科学, 2018, 40(02): 210-217.
[9]	肖文，胡娟，周晓峰. PFPonCanTree：一种基于MapReduce的并行频繁模式增量挖掘算法[J]. 计算机工程与科学, 2018, 40(01): 15-23.
[10]	蔡武越1,王珂2，郝玉洁2，段晓冉2. 一种Hadoop集群下的行为异常检测方法[J]. 计算机工程与科学, 2017, 39(12): 2185-2191.
[11]	江小平，张巍，李成华，周航，孙婧. 面向云存储的基于全同态密码技术的文档相似度计算方法[J]. 计算机工程与科学, 2017, 39(10): 1807-1811.
[12]	吴云蔚，宁芊. 基于Hadoop平台的分布式SVM参数寻优[J]. 计算机工程与科学, 2017, 39(06): 1042-1047.
[13]	黄伟建，杨海龙. Hadoop下改进布隆过滤器算法的网页去重[J]. 计算机工程与科学, 2017, 39(02): 285-290.
[14]	王鑫，陈曙晖，苏金树. 一种基于硬件的大规模哈希流表设计与实现[J]. 计算机工程与科学, 2016, 38(10): 1955-1960.
[15]	雷力1，钱斌海1，郭俊1，顾雄礼2，刘鹏1. 集成I/O硬件压缩加速器的Hadoop系统结构[J]. 计算机工程与科学, 2016, 38(08): 1524-1529.