• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于Hash改进的k-means算法并行化设计

张波,徐蔚鸿,陈沅涛,朱玲     

  1. (长沙理工大学计算机与通信工程学院,湖南 长沙 410114)
  • 收稿日期:2015-07-07 修回日期:2015-09-23 出版日期:2016-10-25 发布日期:2016-10-26
  • 基金资助:

    国家自然科学基金(61402053);湖南省科技计划(2014SK3080);湖南省教育厅优秀青年项目(14B005)

A k-means clustering algorithm
parallelization design  based on Hash

ZHANG Bo,XU Weihong,CHEN Yuantao,ZHU Ling   


  1. (School of Computer &Communication Engineering,Changsha University of Science &Technology,Changsha 410114,China)
  • Received:2015-07-07 Revised:2015-09-23 Online:2016-10-25 Published:2016-10-26

摘要:

为了解决kmeans算法在Hadoop平台下处理海量高维数据时聚类效果差,以及已有的改进算法不利于并行化等问题,提出了一种基于Hash改进的并行化方案。将海量高维的数据映射到一个压缩的标识空间,进而挖掘其聚类关系,选取初始聚类中心,避免了传统kmeans算法对随机选取初始聚类中心的敏感性,减少了kmeans算法的迭代次数。又结合MapReduce框架将算法整体并行化,并通过Partition、Combine等机制加强了并行化程度和执行效率。实验表明,该算法不仅提高了聚类的准确率和稳定性,同时具有良好的处理速度。

关键词: 海量数据, Hadoop, Hash, 并行kmeans聚类, 中心选取

Abstract:

As the traditional kmeans algorithm has poor clustering effect when dealing with massive volume and high dimensional data, and the existing optimization algorithms are not conductive to parallelization, we propose a parallel optimization scheme based on Hash algorithm. We firstly map the massive volume and high dimensional data to a compressed identifier space, then mine the clustering relationship and select the initial clustering center. These steps avoid the sensitivity of the kmeans algorithm to the random selection of the initial clustering center, and reduce the number of iterations. Finally, combined with the MapReduce, the Partition and Combine mechanisms are applied to optimize the parallelization of this algorithm, thus the degree of parallelization and execution efficiency are more strengthened. Experimental results show that the proposed algorithm can improve the clustering accuracy and stability, and has good processing performance as well.

Key words: massive data, Hadoop, Hash, parallel kmeans clustering, center selection