基于频繁词集和k-Means的Web文本聚类混合算法

J4 ›› 2008, Vol. 30 ›› Issue (8): 92-96.

基于频繁词集和k-Means的Web文本聚类混合算法

王乐田李贾焰韩伟红

出版日期:2008-08-01 发布日期:2010-05-19

Online:2008-08-01 Published:2010-05-19

摘要/Abstract

摘要：

当前，Web文本聚类主要存在三个挑战：数据规模海量性、高雏空间处理复杂性和聚类结果的可理解性。针对上述挑战，本文提出了一个基于top-k频繁词集和k—means的混合聚类算法topHDC。该算法在生成初始聚簇时避免了高维空间向量处理，k个频繁词集对聚类结果提供了可理解的解释。topHDC避免了已有算法中聚类结果受文档长度干扰的问题。在两个公共数据集上的实验证明，topHDC算法在聚类质量和运行效率上明显优于另外两个具有代表性的聚类算法。

关键词: 文本挖掘聚类频繁词集 k-means

Abstract:

In order to conquer the major challenges of the current web document clustering, i. e. huge volume of documents, high dimensional process and understa ndability of the clustering results, we propose a simple hybrid algorithm called topHDC based on top- k frequent term sets and k -means. Top- k frequent
term sets are used to produce k initial clusters, which axe further refined by k -means. The understandable description of clustering is provided by k frequent term sets. Experimental results on two public datasets indicate that topHDC outperforms other two representative clustering algorithms both on efficiency and effectiveness.

Key words: text mining;document clustering, frequent term set, k-means

王乐田李贾焰韩伟红. 基于频繁词集和k-Means的Web文本聚类混合算法[J]. J4, 2008, 30(8): 92-96.

基于频繁词集和k-Means的Web文本聚类混合算法

PDF

可视化

摘要/Abstract

引用本文

使用本文

相关文章 0

编辑推荐

Metrics

本文评价