• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

A k-means clustering algorithm
parallelization design  based on Hash

ZHANG Bo,XU Weihong,CHEN Yuantao,ZHU Ling   


  1. (School of Computer &Communication Engineering,Changsha University of Science &Technology,Changsha 410114,China)
  • Received:2015-07-07 Revised:2015-09-23 Online:2016-10-25 Published:2016-10-26

Abstract:

As the traditional kmeans algorithm has poor clustering effect when dealing with massive volume and high dimensional data, and the existing optimization algorithms are not conductive to parallelization, we propose a parallel optimization scheme based on Hash algorithm. We firstly map the massive volume and high dimensional data to a compressed identifier space, then mine the clustering relationship and select the initial clustering center. These steps avoid the sensitivity of the kmeans algorithm to the random selection of the initial clustering center, and reduce the number of iterations. Finally, combined with the MapReduce, the Partition and Combine mechanisms are applied to optimize the parallelization of this algorithm, thus the degree of parallelization and execution efficiency are more strengthened. Experimental results show that the proposed algorithm can improve the clustering accuracy and stability, and has good processing performance as well.

Key words: massive data, Hadoop, Hash, parallel kmeans clustering, center selection