• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2010, Vol. 32 ›› Issue (8): 94-97.doi: 10.3969/j.issn.1007130X.2010.

• 论文 • 上一篇    下一篇

一种高维数据聚类遗传算法

孙浩军,熊琅环   

  1. (汕头大学计算机科学系,广东 汕头 515063)
  • 收稿日期:2009-03-31 修回日期:2009-10-21 出版日期:2010-07-25 发布日期:2010-07-28
  • 作者简介:孙浩军(1963),男,河北衡水人,博士,教授,研究方向为模式识别、数据挖掘等;熊琅环,硕士生,研究方向为数据挖掘。
  • 基金资助:

    广东省自然科学基金资助项目(8151503101000016)

A Genetic Algorithm forHighDimensional Data Clustering

SUN Haojun,XIONG Langhuan   

  1. (Department of Computer Science,Shantou University,Shantou 515063,China)
  • Received:2009-03-31 Revised:2009-10-21 Online:2010-07-25 Published:2010-07-28

摘要:

聚类分析是数据挖掘中的一个重要研究课题。在许多实际应用中,聚类分析的数据往往具有很高的维度,例如文档数据、基因微阵列等数据可以达到上千维,而在高维数据空间中,数据的分布较为稀疏。受这些因素的影响,许多对低维数据有效的经典聚类算法对高维数据聚类常常失效。针对这类问题,本文提出了一种基于遗传算法的高维数据聚类新方法。该方法利用遗传算法的全局搜索能力对特征空间进行搜索,以找出有效的聚类特征子空间。同时,为了考察特征维在子空间聚类中的特征,本文设计出一种基于特征维对子空间聚类贡献率的适应度函数。人工数据、真实数据的实验结果以及与kmeans算法的对比实验证明了该方法的可行性和有效性。

Abstract:

Clustering analysis is an important subject in data mining. In many real applications, the clustering data are usually high dimensional. For example, the document data and DNA microarray data generally have several hundreds or even a thousand dimensions. While in high dimensional space, the distributions of the data are usually sparse; it makes most of those traditional clustering algorithms which work well on lowdimensional data invalid for highdimensional data. To solve such a problem, a new highdimensional data clustering approach based on genetic algorithms is proposed in this paper. The search capability of genetic algorithms is exploited to find the effective feature subspaces for clustering. In order to study the characteristics of dimensions shown in clustering, the degree of features which contribute to subspace clustering is designed as fitness function in this paper. The experimental results on the artificial data set, reallife data set and the comparison experiment with the kmeans algorithm indicate the feasibility and efficiency of the proposed approach.

Key words: highdimensional data clustering;genetic algorithm;feature subspace

中图分类号: