• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (09): 1621-1626.

• 论文 • 上一篇    下一篇

一种改进的基于大数据集的混合聚类算法

张晓,王红   

  1. (1.山东师范大学信息科学与工程学院,山东 济南 250014;2.山东省分布式计算机软件重点实验室,山东 济南 250014)
  • 收稿日期:2014-09-28 修回日期:2014-12-16 出版日期:2015-09-25 发布日期:2015-09-25
  • 基金资助:

    国家自然科学基金资助项目(61373149,61472233);山东省科技计划项目(2012GGX10118,2014GGX101026)

An improved hybrid clustering
algorithm based on large data sets     

ZHANG Xiao,WANG Hong   

  1. (1.School of Information Science and Engineering,Shandong Normal University,Jinan 250014;
    2.Key Laboratory of Distributed Computer Software in Shandong Province,Jinan 250014,China)
  • Received:2014-09-28 Revised:2014-12-16 Online:2015-09-25 Published:2015-09-25

摘要:

针对kmeans算法过度依赖初始聚类中心、收敛速度慢等局限性及其在处理海量数据时存在的内存不足问题,提出一种新的针对大数据集的混合聚类算法superkmeans,将改进的基于超网络的高维数据聚类算法与kmeans相结合,并经过MapReduce并行化后部署在Hadoop集群上运行。实验表明,该算法不仅在收敛性以及聚类精度两方面得到优化,其加速比和扩展性也有了大幅度的改善。

关键词: k-means, 超网络, 频繁项集, 超图划分, MapReduce

Abstract:

Aiming at the following three problems of the kmeans algorithm:excessive dependence on the initial clustering center, slow convergence speed and insufficient memory when dealing with huge amounts of data, we present a new hybrid clustering algorithm called superkmeans for large data sets. The algorithm combines the kmeans algorithm with the improved highdimensional data clustering algorithm based on the supernetwork. We run it on the Hadoop clusters after the MapReduce parallel processing, and an ideal effect of clustering is achieved. Experimental results show that the algorithm not only improves the convergence and the clustering accuracy but also has high speedup and scalability performance.

Key words: k-means;super network;frequent itemsets;hypergraph partitioning;MapReduce