• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (7): 164-168.

• 论文 • 上一篇    下一篇

词共现文本主题聚类算法

白秋产1,金春霞2, 章慧2,周海岩2   

  1. (1.淮阴工学院电子与电气工程学院,江苏 淮安 223003;2.淮阴工学院计算机工程学院,江苏 淮安 223003)
  • 收稿日期:2012-04-09 修回日期:2012-08-13 出版日期:2013-07-25 发布日期:2013-07-25
  • 基金资助:

    淮安科技支撑项目(HGA0906);江苏省教育厅资助项目(2012SJD870001)

Topictext clustering algorithm based on word co-occurrence           

BAI Qiuchan1,JIN Chunxia2,ZHANG Hui2,ZHOU Haiyan2   

  1. (1.School of Electronic and Electrical Engineering,Huaiyin Institute of Technology,Huai’an 223003;
    (2.School of Computer Engineering,Huaiyin Institute of Technology,Huai’an 223003,China)
  • Received:2012-04-09 Revised:2012-08-13 Online:2013-07-25 Published:2013-07-25

摘要:

文本主题是文本聚类的关键,而文档中共现词对对文档主题的表现力非常强。因此,在对现有文本主题挖掘和共现词对抽取算法深入研究的基础上,提出了一种基于关联规则词共现的文本主题聚类算法(TCABARWC),即首先采用关联规则挖掘算法抽取文档共现词对,利用词共现提取文本主题信息,然后根据共现词对建模并实现共现词对相似度量,最后结合层次聚类算法实现文本聚类。实验结果表明,相比其他聚类算法,基于关联规则共现词对的层次聚类算法,大大降低了文本向量的维度以及算法复杂度,在聚类效率和准确性上都有显著提高,并获得了较好的聚类效果。

关键词: 词共现, 关联规则, 数据挖掘, 层次聚类

Abstract:

Text topic is the key of text clustering, the cooccurrence words are very strong to express document theme in document. On the basis of studying the existing text subject mining and the extraction algorithm of word cooccurrence, this paper proposed a topic text clustering algorithm based on association rules and word cooccurrence. Firstly the algorithm extracts cooccurrence words of document by association rule mining algorithm. Secondly, according to the cooccurrence word, the similarity measure of cooccurrence word pairs was implemented. Finally it uses the hierarchical clustering algorithm to finish the document clustering. Experimental results show that the hierarchical clustering algorithm based on word cooccurrence can not only greatly reduce high dimension of text vector and the algorithm complexity, but also significantly improves the efficiency and accuracy of text clustering, in comparison to other algorithms, and it can also produce the clustering effect of good quality.

Key words: word co-occurrence;relation rules;data mining;hierarchical clustering