• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (09): 1621-1626.

• 论文 • Previous Articles     Next Articles

An improved hybrid clustering
algorithm based on large data sets     

ZHANG Xiao,WANG Hong   

  1. (1.School of Information Science and Engineering,Shandong Normal University,Jinan 250014;
    2.Key Laboratory of Distributed Computer Software in Shandong Province,Jinan 250014,China)
  • Received:2014-09-28 Revised:2014-12-16 Online:2015-09-25 Published:2015-09-25

Abstract:

Aiming at the following three problems of the kmeans algorithm:excessive dependence on the initial clustering center, slow convergence speed and insufficient memory when dealing with huge amounts of data, we present a new hybrid clustering algorithm called superkmeans for large data sets. The algorithm combines the kmeans algorithm with the improved highdimensional data clustering algorithm based on the supernetwork. We run it on the Hadoop clusters after the MapReduce parallel processing, and an ideal effect of clustering is achieved. Experimental results show that the algorithm not only improves the convergence and the clustering accuracy but also has high speedup and scalability performance.

Key words: k-means;super network;frequent itemsets;hypergraph partitioning;MapReduce