• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (09): 1761-1767.

• 论文 • 上一篇    下一篇



  1. (重庆理工大学计算机科学与工程学院,重庆 400054)
  • 收稿日期:2014-10-28 修回日期:2014-12-18 出版日期:2015-09-25 发布日期:2015-09-25
  • 基金资助:


A novel algorithm for feature selection
on micro-blog short texts  

HUANG Xianying,CHEN Hongyang,LIU Yingtao,XIONG Liyuan   

  1. (College of Computer Science and Engineering,Chongqing University of Technology,Chongqing 400054,China)
  • Received:2014-10-28 Revised:2014-12-18 Online:2015-09-25 Published:2015-09-25



关键词: 微博短文本, 特征词选择, 统计与语义信息, 词性组合, 朴素贝叶斯分类算法


The valid features of microblog short texts are sparse and difficult to extract, which reduces the accuracy of text representation, classification and clustering. We propose a novel algorithm for feature selection on microblog short texts based on statistics and semantic information. We utilize Term FrequencyInverse Document Frequency (TFIDF), POS and the length of term to construct the evaluation function, and together with the semantic relevance between term and microblog short texts, the feature selection on microblog short texts is achieved, which guarantees that the selected terms can represent the meaning of microblog short texts more accurately. The new feature selection algorithm is integrated with Naive Bayesian categorization algorithm, and the experiments on an open microblog corpus show the proposed algorithm can acquire a higher precision rate of text categorization compared with the traditional strategies, indicating that the selected terms by the proposed algorithm can represent the topic of micro-blog short text more accurately.

Key words: micro-blog short text;feature selection;statistics and semantic information;POS grouping;Naive Bayesian classification algorithm