• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2008, Vol. 30 ›› Issue (8): 92-96.

• 论文 • Previous Articles     Next Articles

  

  • Online:2008-08-01 Published:2010-05-19

Abstract:

In order to conquer the major challenges of the current web document clustering, i. e. huge volume of documents, high dimensional process and understa ndability of the clustering results, we propose a simple hybrid algorithm called topHDC based on top- k frequent term sets and k -means. Top- k frequent
  term sets are used to produce k initial clusters, which axe further refined by k -means. The understandable description of clustering is provided by k  frequent term sets. Experimental results on two public datasets indicate that topHDC outperforms other two representative clustering algorithms both on  efficiency and effectiveness.

Key words: text mining;document clustering, frequent term set, k-means