• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (09): 1761-1767.

• 论文 • 上一篇    下一篇

一种新的微博短文本特征词选择算法

黄贤英,陈红阳,刘英涛,熊李媛   

  1. (重庆理工大学计算机科学与工程学院,重庆 400054)
  • 收稿日期:2014-10-28 修回日期:2014-12-18 出版日期:2015-09-25 发布日期:2015-09-25
  • 基金资助:

    国家自然科学基金资助项目(61173184);重庆市教委科技计划项目(KJ100821);重庆市科委自然科学基金资助项目(CSTC2012jjA40030)

A novel algorithm for feature selection
on micro-blog short texts  

HUANG Xianying,CHEN Hongyang,LIU Yingtao,XIONG Liyuan   

  1. (College of Computer Science and Engineering,Chongqing University of Technology,Chongqing 400054,China)
  • Received:2014-10-28 Revised:2014-12-18 Online:2015-09-25 Published:2015-09-25

摘要:

针对微博短文本有效特征较稀疏且难以提取,从而影响微博文本表示、分类与聚类准确性的问题,提出一种基于统计与语义信息相结合的微博短文本特征词选择算法。该算法基于词性组合匹配规则,根据词项的TFIDF、词性与词长因子构造综合评估函数,结合词项与文本内容的语义相关度,对微博短文本进行特征词选择,以使挑选出来的特征词能准确表示微博短文本内容主题。将新的特征词选择算法与朴素贝叶斯分类算法相结合,对微博分类语料集进行实验,结果表明,相比其它的传统算法,新算法使得微博短文本分类准确率更高,表明该算法选取出来的特征词能够更准确地表示微博短文本内容主题。

关键词: 微博短文本, 特征词选择, 统计与语义信息, 词性组合, 朴素贝叶斯分类算法

Abstract:

The valid features of microblog short texts are sparse and difficult to extract, which reduces the accuracy of text representation, classification and clustering. We propose a novel algorithm for feature selection on microblog short texts based on statistics and semantic information. We utilize Term FrequencyInverse Document Frequency (TFIDF), POS and the length of term to construct the evaluation function, and together with the semantic relevance between term and microblog short texts, the feature selection on microblog short texts is achieved, which guarantees that the selected terms can represent the meaning of microblog short texts more accurately. The new feature selection algorithm is integrated with Naive Bayesian categorization algorithm, and the experiments on an open microblog corpus show the proposed algorithm can acquire a higher precision rate of text categorization compared with the traditional strategies, indicating that the selected terms by the proposed algorithm can represent the topic of micro-blog short text more accurately.

Key words: micro-blog short text;feature selection;statistics and semantic information;POS grouping;Naive Bayesian classification algorithm