• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于词向量语义聚类的微博热点挖掘方法

刘培磊,唐晋韬,王挺,谢松县,岳大鹏,刘海池   

  1. (国防科技大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2016-03-29 修回日期:2016-06-07 出版日期:2018-02-25 发布日期:2018-02-25
  • 基金资助:

    国家自然科学基金(61532001,61472436)

A Twitter hotspot mining method based on
 sematic clustering of word vectors
 

LIU Pei-lei,TANG Jin-tao,WANG Ting,XIE Song-xian,YUE Da-peng,LIU Hai-chi   

  1. (College of Computer,National University of Defense Technology,Changsha 410073,China)
  • Received:2016-03-29 Revised:2016-06-07 Online:2018-02-25 Published:2018-02-25

摘要:

随着社交媒体的迅速发展,信息过载问题越发严重,因此如何从海量、短小而充满噪声的社交媒体数据中发现和挖掘出热点话题或者热点事件成为一个重要的问题。结合社交媒体数据实时性、地理性、包含较多元数据等特点,提出了用户行为分析与文本内容分析相结合的热点挖掘方法。在内容分析过程中,提出了从更细的词语粒度进行聚类,以代替传统的在消息粒度进行聚类的经典方法。为了提高话题关键词提取的效果,引入了基于词向量技术,并通过语义聚类的方法进行热点挖掘。在真实数据集上的实验结果表明,该方法提取的关键词语义关联性强、话题划分效果好,在主要指标上优于传统的热点挖掘方法。

 

关键词: 热点挖掘, 社交媒体, 词向量, 语义聚类

Abstract:

With the rapid development of social media, information overloading becomes a challenge. As a result, how to mining hotspots automatically from so many short and noisy data is an important problem. Social data are real-time and geographic, which usually contain plenty of meta-information. According to these characteristics, this paper proposes a hotspot mining method, which combines user’s behavior patterns and text content analysis. In the process of content analysis, we cluster text on the word scale rather than message scale. Besides, sematic clustering technology of word vectors is used for promoting the performance of keywords extraction. Experimental results on real datasets show that this method is better than traditional methods. Specifically, keywords extracted by this method have strong semantic relevance and good topic segmentation, which are superior to the traditional hot-spot mining methods on the main indexes.
 

Key words: hotspot mining, Twitter, word embedding, semantic clustering