• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2011, Vol. 33 ›› Issue (6): 154-158.

• 论文 • 上一篇    下一篇

位置加权文本聚类算法

金春霞,周海岩   

  1. (淮阴工学院计算机工程学院,江苏 淮安 223003)
  • 收稿日期:2010-09-15 修回日期:2011-12-28 出版日期:2011-06-25 发布日期:2011-06-25
  • 作者简介:金春霞(1973),女,陕西兴平人,副教授,研究方向为计算机应用、信息处理和数据挖掘。周海岩(1957),男,河南人,教授,CCF会员(E200011783S),研究方向为信息安全、数据挖掘、人工智能和智能决策。
  • 基金资助:

    江苏省科技攻关项目(BE2006357)

A Text Clustering Algorithm Based on Position Weighting

JIN Chunxia,ZHOU Haiyan   

  1. (School of Computer Engineering,Huaiyin Institute of Technology,Huaian 223003,China)
  • Received:2010-09-15 Revised:2011-12-28 Online:2011-06-25 Published:2011-06-25

摘要:

文本聚类是自然语言处理研究中一项重要研究课题,文本聚类技术广泛地应用于信息检索、Web挖掘和数字图书馆等领域。本文针对特征词在文档中的不同位置对文档的贡献大小不同,提出了基于特征词的位置加权文本聚类改进算法——TCABPW。通过选取反映文档主题的前L个高权值的特征项构造新的文本特征向量,采用层次聚类和Kmeans文本聚类相结合的改进算法实现文本聚类。实验结果表明,提出的改进算法在不影响聚类质量的情况下大大地降低了文本聚类的维度,在稳定性和纯度上都有显著提高,获得了较好的聚类效果。

关键词: 文本聚类, 文本向量, 特征选择, 位置加权, 簇间相似度

Abstract:

Document clustering is an important research topic of natural language processing and is widely applicable in the areas such as information retrieval, web mining and digital libraries. Because the feature terms of different positions in the document are different for the article’s contribution, TCABPW (a text clustering algorithm based on position weighting) is proposed in this paper. We construct a new text vector by selecting Ltopweight text that reflects the topical subject of the document and it is used to realize text clustering by hierarchical clustering and the Kmeans method. The results show that without affecting the quality of text clustering, the algorithm can not only greatly reduce the high dimension of text clustering, but also can significantly increase the stability and purity of text clutering, and can also produce the clusering effect of good quality.

Key words: text clustering;text vector;feature selecting;position weighting;similarity between clusters