• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2021, Vol. 43 ›› Issue (09): 1645-1652.

Previous Articles     Next Articles

Text feature selection based on improved CHI and PCA

Text feature selection based on improved CHI and PCA#br#

#br#
  

  1. (1.School of Communication and Information Engineering,
    Chongqing University of Posts and Telecommunications,Chongqing 400065;

    2.Research Center of New Telecommunication Technology,
    Chongqing University of Posts and Telecommunications,Chongqing 400065;

    3.Chongqing Information Technology Designing Co.Ltd.,Chongqing 401121,China)

  • Received:2020-05-26 Revised:2020-07-17 Accepted:2021-09-25 Online:2021-09-25 Published:2021-09-27

Abstract: Aiming at the large amount of noise and redundant features in text data, in order to obtain a more representative feature set, a feature selection algorithm  (ICHIPCA) combining improved CHI-square statistics (ICHI) and principal component analysis (PCA) is proposed. Firstly, the CHI algorithm ignores word frequency, document length, category distribution, and negative correlation characteristics, and introduces corresponding adjustment factors to improve the CHI calculation model. Secondly, the improved CHI calculation model is used to evaluate the features, and selects the top features as the primary selection feature set. Finally, PCA algorithm is used to extract the main components while basically retaining the original information to achieve dimensionality reduction. Verification on the KNN classifier shows that, compared with the traditional feature selection algorithm IG and CHI equivalent type algorithm, the ICHIPCA algorithm improves the classification performance in multiple feature dimensions and multiple categories.


Key words: text classification, principal component analysis, CHI-square statistics, dimensionality reduction, feature selection