• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (9): 1761-1767.

• 论文 • Previous Articles     Next Articles

A novel algorithm for feature selection
on micro-blog short texts  

HUANG Xianying,CHEN Hongyang,LIU Yingtao,XIONG Liyuan   

  1. (College of Computer Science and Engineering,Chongqing University of Technology,Chongqing 400054,China)
  • Received:2014-10-28 Revised:2014-12-18 Online:2015-09-25 Published:2015-09-25

Abstract:

The valid features of microblog short texts are sparse and difficult to extract, which reduces the accuracy of text representation, classification and clustering. We propose a novel algorithm for feature selection on microblog short texts based on statistics and semantic information. We utilize Term FrequencyInverse Document Frequency (TFIDF), POS and the length of term to construct the evaluation function, and together with the semantic relevance between term and microblog short texts, the feature selection on microblog short texts is achieved, which guarantees that the selected terms can represent the meaning of microblog short texts more accurately. The new feature selection algorithm is integrated with Naive Bayesian categorization algorithm, and the experiments on an open microblog corpus show the proposed algorithm can acquire a higher precision rate of text categorization compared with the traditional strategies, indicating that the selected terms by the proposed algorithm can represent the topic of micro-blog short text more accurately.

Key words: micro-blog short text;feature selection;statistics and semantic information;POS grouping;Naive Bayesian classification algorithm