• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2012, Vol. 34 ›› Issue (6): 140-145.

• 论文 • 上一篇    下一篇

基于词间关系分析的文本特征选择算法

吴双,张文生,徐海瑞   

  1. (中国科学院自动化研究所,北京 100190)
  • 收稿日期:2011-04-29 修回日期:2011-07-15 出版日期:2012-06-25 发布日期:2012-06-25
  • 基金资助:

    国家自然科学基金资助项目(90924026)

A Text Feature Selection Algorithm Based on Analysing the Relationship Between Words

WU Shuang,ZHANG Wensheng,XU Hairui   

  1. (Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China)
  • Received:2011-04-29 Revised:2011-07-15 Online:2012-06-25 Published:2012-06-25

摘要:

传统的特征选择方法通常使用特征评价函数从原始词集中筛选出最具有类别区分能力的特征。这些方法是基于以独立的词作为语义单元的向量空间模型,忽略了词与词之间的关联关系,难以突出文本内容中的关键特征。针对传统特征选择方法的不足,本文提出一种新的基于词间关系的文本特征选择算法。该方法考虑对文本内容表示起到关键性作用的词,利用关联规则挖掘算法发现词语之间的关联关系,并且通过相关分析对强关联规则进行筛选,最终生成与类别属性密切相关的特征空间。实验结果表明,该方法更好地表示了文本的语义内容,而且分类效果优于传统算法。

关键词: 词间关系, 特征选择, 关联规则, 文本分类

Abstract:

The traditional feature selection algorithms usually select features distinguishing the different types of documents by the evaluation functions. However, these methods take the separate word as unit to establish a vector space model. The important words in the documents and the relationship between words are  not realized. In allusion to the disadvantages mentioned above, a new feature selection algorithm based on the relationship between words is presented. This algorithm considers key words, mines words’ association and checks these association rules by a correlation analysis to produce a feature space which closely relates to the category attributes. The experiment indicates that this method is better to express the semantic content of the documents and has a good categorization result.

Key words: relationship between words;feature selection;association rule;text categorization