• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (09): 1645-1652.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于改进CHI和PCA的文本特征选择

文武1,2,3,万玉辉1,2,张许红1,2,文志云1,2


  

  1. (1.重庆邮电大学通信与信息工程学院,重庆 400065;2.重庆邮电大学通信新技术应用研究中心,重庆 400065;

    3.重庆信科设计有限公司,重庆 401121)

  • 收稿日期:2020-05-26 修回日期:2020-07-17 接受日期:2021-09-25 出版日期:2021-09-25 发布日期:2021-09-27

Text feature selection based on improved CHI and PCA

Text feature selection based on improved CHI and PCA#br#

#br#
  

  1. (1.School of Communication and Information Engineering,
    Chongqing University of Posts and Telecommunications,Chongqing 400065;

    2.Research Center of New Telecommunication Technology,
    Chongqing University of Posts and Telecommunications,Chongqing 400065;

    3.Chongqing Information Technology Designing Co.Ltd.,Chongqing 401121,China)

  • Received:2020-05-26 Revised:2020-07-17 Accepted:2021-09-25 Online:2021-09-25 Published:2021-09-27

摘要: 针对文本数据中含有大量噪声和冗余特征,为获取更有代表性的特征集合,提出了一种结合改进卡方统计(ICHI)和主成分分析(PCA)的特征选择算法(ICHIPCA)。首先针对CHI算法忽略词频、文档长度、类别分布及负相关特性等问题,引入相应的调整因子来完善CHI计算模型;然后利用改进后的CHI计算模型对特征进行评价,选取靠前特征作为初选特征集合;最后通过PCA算法在基本保留原始信息的情况下提取主要成分,实现降维。通过在KNN分类器上验证,与传统特征选择算法IG、CHI等同类型算法相比,ICHIPCA算法在多种特征维度及多个类别下,实现了分类性能的提升。

关键词: 文本分类, PCA, CHI, 降维, 特征选择

Abstract: Aiming at the large amount of noise and redundant features in text data, in order to obtain a more representative feature set, a feature selection algorithm  (ICHIPCA) combining improved CHI-square statistics (ICHI) and principal component analysis (PCA) is proposed. Firstly, the CHI algorithm ignores word frequency, document length, category distribution, and negative correlation characteristics, and introduces corresponding adjustment factors to improve the CHI calculation model. Secondly, the improved CHI calculation model is used to evaluate the features, and selects the top features as the primary selection feature set. Finally, PCA algorithm is used to extract the main components while basically retaining the original information to achieve dimensionality reduction. Verification on the KNN classifier shows that, compared with the traditional feature selection algorithm IG and CHI equivalent type algorithm, the ICHIPCA algorithm improves the classification performance in multiple feature dimensions and multiple categories.


Key words: text classification, principal component analysis, CHI-square statistics, dimensionality reduction, feature selection