• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    

结合词向量和聚类算法的新闻评论话题演进分析

林江豪1,周咏梅1,2,阳爱民1,2,王伟2   


  1. (1.广东外语外贸大学语言工程与计算实验室,广东 广州 510006;
    2.广东外语外贸大学思科信息学院,广东 广州 510006)
     
  • 收稿日期:2016-07-01 修回日期:2016-09-05 出版日期:2016-11-25 发布日期:2016-11-25
  • 基金资助:

    国家社科基金项目(12BYY045);广东省哲学社会科学“十二五”规划项目(GD15YTS01)

Analysis on topic evolution of news comments by
combining word vector and clustering algorithm

LIN Jianghao1,ZHOU Yongmei1,2,YANG Aimin1,2,WANG Wei2   

  1. (1.Laboratory for Language Engineering and Computing,Guangdong University of Foreign
    Studies,Guangzhou 510006;
    2.Cisco School of Informatics,Guangdong University of Foreign Studies,Guangzhou
    510006,China)
  • Received:2016-07-01 Revised:2016-09-05 Online:2016-11-25 Published:2016-11-25

摘要:

话题演进分析主要是挖掘话题内容随着时间流的演进情况。话题的内容可用关键词来表示。利用
word2vec对75万篇新闻和微博文本进行训练,得到词向量模型。将文本流处理后输入模型,获得时间序
列下所有词汇的词向量,利用Kmeans对词向量进行聚类,从而实现话题关键词的抽取。实验对比了基
于PLSA和LDA主题模型下的话题抽取效果,发现本文的话题分析效果优于主题模型的方法。同时,采集足
够大量、内容足够丰富的语料,可训练得到泛化能力比较强的模型,有利于实时话题演进分析研究工作

关键词: 话题演进, word2vec, PLSA, LDA

Abstract:

The analysis of topic evolution is regarded as the mining of topic content evolving with
the time. This article, based on the hypothesis that topic content may be embodied by key
words, adopt word2vec for the training of 750 thousand pieces of news and microblog texts
to establish the model of word vector. The text information flow is applied to the model
and all word vectors by time series are acquired. Kmeans is used to cluster the word
vectors before the key words are drawn and the analysis of topic evolution is visualized.
By comparing the effect of the word vector model with those of PLSA or LDA topic models on
drawing topic, the results show that the former is more effective than the latter two
models. In addition, the collection of abundant and varied data can facilitate the training
of the word vector model with better generalization ability and the investigation on real
time analysis of topic evolution.

Key words: topic evolution, word2vec, PLSA, LDA