• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2008, Vol. 30 ›› Issue (8): 92-96.

• 论文 • 上一篇    下一篇

基于频繁词集和k-Means的Web文本聚类混合算法

王乐 田李 贾焰 韩伟红   

  • 出版日期:2008-08-01 发布日期:2010-05-19

  • Online:2008-08-01 Published:2010-05-19

摘要:

当前,Web文本聚类主要存在三个挑战:数据规模海量性、高雏空间处理复杂性和聚类结果的可理解性。针对上述挑战,本文提出了一个基于top-k频繁词集和k—means的混合 聚类算法topHDC。该算法在生成初始聚簇时避免了高维空间向量处理,k个频繁词集对聚类结果提供了可理解的解释。topHDC避免了已有算法中聚类结果受文档长度干扰的问 题。在两个公共数据集上的实验证明,topHDC算法在聚类质量和运行效率上明显优于另外两个具有代表性的聚类算法。

关键词: 文本挖掘 聚类 频繁词集 k-means

Abstract:

In order to conquer the major challenges of the current web document clustering, i. e. huge volume of documents, high dimensional process and understa ndability of the clustering results, we propose a simple hybrid algorithm called topHDC based on top- k frequent term sets and k -means. Top- k frequent
  term sets are used to produce k initial clusters, which axe further refined by k -means. The understandable description of clustering is provided by k  frequent term sets. Experimental results on two public datasets indicate that topHDC outperforms other two representative clustering algorithms both on  efficiency and effectiveness.

Key words: text mining;document clustering, frequent term set, k-means