• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2012, Vol. 34 ›› Issue (9): 160-165.

• 论文 • 上一篇    下一篇

一种基于朴素贝叶斯的微博情感分类

林江豪1,阳爱民2,周咏梅2,陈锦3,蔡泽键2   

  1. (1.广东外语外贸大学国际工商管理学院,广东 广州 510006;
    2.广东外语外贸大学思科信息学院,广东 广州 510006;
    3.广东外语外贸大学英语语言文化学院,广东 广州 510006)
  • 收稿日期:2012-04-13 修回日期:2012-06-25 出版日期:2012-09-25 发布日期:2012-09-25
  • 基金资助:

    国家社科基金资助项目(12BYY045);教育部人文社会科学研究青年资助项目(10YJCZH247);广东省科技计划资助项目(2010B031000014);广东外语外贸大学研究生科研创新资助项目;广东外语外贸大学大学生创新实验资助项目

Classification of Microblog SentimentBased on Nave Bayesian

LIN Jianghao1,YANG Aimin2,ZHOU Yongmei2,CHEN Jin3,CAI Zejian2   

  1. (1.School of Management,Guangdong University of Foreign Studies,Guangzhou 510006;
    2.Cisco School of Informatics,Guangdong University of Foreign Studies,Guangzhou 510006;
    3.School of English Language and Culture,
    Guangdong University of Foreign Studies,Guangzhou 510006,China)
  • Received:2012-04-13 Revised:2012-06-25 Online:2012-09-25 Published:2012-09-25

摘要:

本文基于二次情感特征提取算法,利用句法依存关系进行一次文本情感特征提取,在此基础上,利用情感词典,进行二次情感特征提取。构建朴素贝叶斯分类器,对采集的热门话题微博和酒店评论进行文本情感倾向性分类。主要比较了表情符号、标点符号,基于情感词典的特征提取和基于二次情感特征提取方法,在不同的组合下的分类性能,寻找更佳的微博文本情感分类预处理方法。并与酒店评论情感分类结果对比、分析,发现影响微博情感分类性能的原因。实验结果表明,二次特征提取方法在分类上取得更高的F1。实验最佳的分类预处理方式是“表情符号+标点符号+二次情感特征提取+BOOL值”。同时发现,朴素贝叶斯在酒店评论情感分类取得更高的分类性能,主要是微博评价对象多样化造成的。

关键词: 微博, 文本情感分类, 二次情感特征提取, 朴素贝叶斯

Abstract:

Based on the twice sentiment feature extraction approach,this paper uses syntactic dependency as the first extraction method and semantic lexicon as the second.A sentiment classifier based on nave Bayesian is constructed in order to classify the inclination of emotions from the collected hot topic data in Chinese microblog and hotel remarks.The experiments mainly compare the classification performance of different combination groups including emoticons,punctuation, extraction methods based on semantic lexicon feature and those based on twice sentiment feature to find out better pretreatment methods for sentiment classification of microblog text. Besides,the experiments also compare and analyze the sentiment classification results between microblog text and hotel remarks to seek out the reasons for influencing the classification performance of microblog sentiment.The results indicate that the twice sentiment feature extraction gain the higher F1.And the performance of “emoticons + punctuation + twice sentiment feature extraction + BOOL” is the best pretreatment method.Meanwhile,it also shows the reason why the classifier based on nave Bayesian obtains higher classification performance in hotel remarks is probably that the topic in microblog is various.

Key words: microblog;text sentiment classification;twice sentiment feature extraction;nave Bayesian