一种改进的基于大数据集的混合聚类算法

J4 ›› 2015, Vol. 37 ›› Issue (09): 1621-1626.

一种改进的基于大数据集的混合聚类算法

张晓，王红

（1.山东师范大学信息科学与工程学院，山东济南 250014;2.山东省分布式计算机软件重点实验室，山东济南 250014）

收稿日期:2014-09-28 修回日期:2014-12-16 出版日期:2015-09-25 发布日期:2015-09-25
基金资助:
国家自然科学基金资助项目（61373149,61472233）；山东省科技计划项目（2012GGX10118，2014GGX101026）

An improved hybrid clustering
algorithm based on large data sets

ZHANG Xiao,WANG Hong

(1.School of Information Science and Engineering,Shandong Normal University,Jinan 250014;
2.Key Laboratory of Distributed Computer Software in Shandong Province,Jinan 250014,China)

Received:2014-09-28 Revised:2014-12-16 Online:2015-09-25 Published:2015-09-25

摘要/Abstract

摘要：

针对kmeans算法过度依赖初始聚类中心、收敛速度慢等局限性及其在处理海量数据时存在的内存不足问题，提出一种新的针对大数据集的混合聚类算法superkmeans，将改进的基于超网络的高维数据聚类算法与kmeans相结合，并经过MapReduce并行化后部署在Hadoop集群上运行。实验表明，该算法不仅在收敛性以及聚类精度两方面得到优化，其加速比和扩展性也有了大幅度的改善。

关键词: k-means, 超网络, 频繁项集, 超图划分, MapReduce

Abstract:

Aiming at the following three problems of the kmeans algorithm:excessive dependence on the initial clustering center, slow convergence speed and insufficient memory when dealing with huge amounts of data, we present a new hybrid clustering algorithm called superkmeans for large data sets. The algorithm combines the kmeans algorithm with the improved highdimensional data clustering algorithm based on the supernetwork. We run it on the Hadoop clusters after the MapReduce parallel processing, and an ideal effect of clustering is achieved. Experimental results show that the algorithm not only improves the convergence and the clustering accuracy but also has high speedup and scalability performance.

Key words: k-means;super network;frequent itemsets；hypergraph partitioning;MapReduce

张晓，王红. 一种改进的基于大数据集的混合聚类算法[J]. J4, 2015, 37(09): 1621-1626.

ZHANG Xiao,WANG Hong. An improved hybrid clustering
algorithm based on large data sets [J]. J4, 2015, 37(09): 1621-1626.

编辑推荐

Metrics

阅读次数

全文

230

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	230

来源	本网站	其他网站

次数	194	36
比例	84%	16%

摘要

104

最新录用	在线预览	正式出版

0	0	104

	来源	本网站

	次数	104
	比例	100%

[1]	刘浩翰, 孙铖, 贺怀清, 惠康华. 基于改进YOLOv3的金属表面缺陷检测[J]. 计算机工程与科学, 2023, 45(07): 1226-1235.
[2]	李兰, 刘杰, 张洁. 基于YOLOv4改进算法的复杂行人检测模型研究[J]. 计算机工程与科学, 2022, 44(08): 1449-1456.
[3]	庞兴龙, 朱国胜, 杨少龙, 李修远. 一种基于聚类与噪声的网络流量分类方法[J]. 计算机工程与科学, 2022, 44(07): 1207-1215.
[4]	黄志强, 李军, 张世义. 基于轻量级神经网络的目标检测研究[J]. 计算机工程与科学, 2022, 44(07): 1265-1272.
[5]	苏小会, 张玉西, 徐淑萍, 尚煜. 改进K-means聚类算法行驶工况及油耗研究[J]. 计算机工程与科学, 2021, 43(11): 2020-2026.
[6]	谢挺, 刘瑞华, 魏正元. 一类连续的K-means 等价聚类模型及其优化算法[J]. 计算机工程与科学, 2021, 43(11): 2077-2083.
[7]	张瑾, 洪莉, 戴二壮. 求解带容量和时间窗约束车辆路径问题的改进蝙蝠算法[J]. 计算机工程与科学, 2021, 43(08): 1479-1487.
[8]	陈俊彦, 李玥, 梁楚欣, 雷晓春. SDN多控制器部署及流量均衡研究[J]. 计算机工程与科学, 2021, 43(05): 830-835.
[9]	方姣丽, 左克, 黄春, 刘杰, 李胜国, 卢凯. FD-LSTM:基于大规模系统日志的故障分析模型[J]. 计算机工程与科学, 2021, 43(01): 33-41.
[10]	文凯, 耿小海, 朱璐伟, 许萌萌, . 基于AO算法的数据流频繁项集挖掘[J]. 计算机工程与科学, 2020, 42(12): 2259-2264.
[11]	廖纪勇，吴晟，刘爱莲. 基于布尔矩阵约简的Apriori算法改进研究[J]. 计算机工程与科学, 2019, 41(12): 2231-2238.
[12]	刘梓璇，周建涛. 负载均衡的主导资源公平分配算法[J]. 计算机工程与科学, 2019, 41(09): 1574-1580.
[13]	周钢,郭福亮. 基于信息熵的集成学习过程多样性度量研究[J]. 计算机工程与科学, 2019, 41(09): 1700-1707.
[14]	王宇新，王飞，王冠，郭禾. 一种基于两级DAG模型的MapReduce工作流异构调度算法[J]. 计算机工程与科学, 2019, 41(08): 1353-1359.
[15]	周伟，肖杨 . 基于Canopy聚类的谱聚类算法[J]. 计算机工程与科学, 2019, 41(06): 1095-1100.