• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2007, Vol. 29 ›› Issue (11): 152-156.

• 论文 • 上一篇    

一种集成NER的文本分类特征选择方法

施德明 林洋港 陈恩红   

  • 出版日期:2007-11-01 发布日期:2010-05-30

  • Online:2007-11-01 Published:2010-05-30

摘要:

文本分类是将自由文本自动划分到若干预先定义类别的方法,在信息检索等领域有很重要的作用。其中,如何选择有效的文本特征是影响文本分类器分类性能的一个重要步骤  。很多应用中需要处理的文本信息包含了很多的命名实体,如某个行业的名人,往往能够在很大程度上影响着文本所属的类别。然而,现阶段的文本特征方法都只利用关键词
词的统计意义,而没有考虑关键词作为命名实体所含有的分类特征。针对这一问题,本文提出了一种将命名实体识别方法NER集成到文本分类特征选择中的方法,在保留关键  词统计特征之外,还保留了单词作为命名实体的分类特征。实验结果表明,相对于其他特征选择方法而言,本文提出的方法在一定程度上提高了文本分类的分类准确率。

关键词: 命名实体识别 命名实体 特征选择 文本分类 隐马尔可夫模型

Abstract:

Text Classification (TC) is the process of automatically assigning predefined categories to free text documents, which is very important to informat   ion retrieval and some other areas. The most important step in TC is how to select the features that can effectively represent the class information of   the original documents. In some TC applications,documents usually contain lots of named entities, e.g., some organization names in specific areas, which  may significantly influence the classification of the documents. While in the recent researches, the selection of features mainly focuses on the orthog  raphy of words, disregarding the information the word contains as a named entity. To solve this problem, this paper proposes a method of feature selecti on for text classification based on named entity recognition (NER). This method makes use of the category information of the word as a named entity, a  as well as the orthography characteristics. According to the experiments, this method improves the efficiency compared with the classic feature selectio n methods.

Key words:  (named entity recognition(NER), named entity, feature selection, text classification, hidden m arkov model(HMM))