• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (2): 127-132.

• 论文 • 上一篇    下一篇

面向混合属性数据集的双重聚类方法

陈新泉   

  1. (重庆三峡学院计算机科学与工程学院,重庆 404000)
  • 收稿日期:2012-05-19 修回日期:2012-08-23 出版日期:2013-02-25 发布日期:2013-02-25
  • 基金资助:

    重庆三峡学院科学研究项目计划资助(11ZZ058)

Dual clustering method of mixed data set

Chen Xinquan     

  1. (School of Computer Science and Engineering,Chongqing Three Gorges University,Chongqing 404000,China)
  • Received:2012-05-19 Revised:2012-08-23 Online:2013-02-25 Published:2013-02-25

摘要:

面对复杂信息环境下的数据预处理需求,提出了一种可以处理混合属性数据集的双重聚类方法。这种双重聚类方法由双重近邻无向图的构造算法或其改进算法,基于分离集合并的双重近邻图聚类算法、基于宽度优先搜索的双重近邻图聚类算法、或基于深度优先搜索的双重近邻图聚类算法来实现。通过人工数据集和UCI标准数据集的仿真实验,可以验证,尽管这三个聚类算法所采用的搜索策略不同,但最终的结果是一致的。仿真实验结果还表明,对于一些具有明显聚类分布结构且无近邻噪声干扰的数据集,该方法经常能取得比Kmeans算法和AP算法更好的聚类精度,从而说明这种双重聚类方法具有一定的有效性。为进一步推广并在实际中发掘出该方法的应用价值,最后给出了一点较有价值的研究展望。

关键词: 混合数据集, 分离集, 宽度优先搜索, 深度优先搜索, 双重聚类

Abstract:

In order to effectively preprocessing mixed data sets from complex information environment, this paper proposes a dual clustering method. This dual clustering method is implemented by a construction algorithm of a dual near neighbor undirected graph or its improved algorithm, a clustering algorithm based on merging disjointset, a clustering algorithm based on breadthfirstsearch, or a clustering algorithm based on depthfirstsearch. Through the simulation experiments of some artificial data sets and UCI standard data sets, we can verify that the three clustering algorithms can obtain the same results in the end, although they use different search strategies. The experimental results also show that this dual clustering method can often obtain better clustering quality than kmeans algorithm and AP algorithm when handling some data sets with apparent clusters and without near neighbors noises. This demonstrates the dual clustering method is comparatively effective and practical. In the end, some research expectations are given to disinter and popularize this method.

Key words: mixed data set;disjointset;breadthfirstsearch;depthfirstsearch;dual clustering