• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

基于知网的个人微博语义相关度的聚类研究

高永兵1,宋添树1,2,李江宇1,马占飞3   

  1. (1.内蒙古科技大学信息工程学院,内蒙古 包头 014010;2.桂林航天工业学院计算机科学与工程学院,广西 桂林 541004;
    3.包头师范学院计算机系,内蒙古 包头 014030)
     
  • 收稿日期:2018-03-06 修回日期:2018-10-17 出版日期:2019-06-25
  • 基金资助:

    国家自然科学基金(61762071);内蒙古自治区自然科学基金(2015MS0621)

Individual microblog clustering by
semantic correlation based on HowNet

GAO Yongbing1,SONG Tianshu1,2,LI Jiangyu1,MA Zhanfei3   

  1. (1.School of Information Engineering,Inner Mongolia University of Science and Technology,Baotou 014010;
    2.School of Computer Science and Engineering,Guilin University of Aerospace Technology,Guilin 541004;
    3.Department of Computer,Baotou Teachers’ College,Baotou 014030,China)

     
  • Received:2018-03-06 Revised:2018-10-17 Online:2019-06-25

摘要:

聚类相关度大的个人微博有助于快速了解博主的专业兴趣和经历,目前的短文本聚类方法缺乏对于语义和句子相关度的充分考虑,提出了一种基于知网的个人微博语义相关度的聚类方法。其要点如下:(1)利用Skipgram训练大量微博文本生成词汇向量;(2)根据词汇义原进行句内词汇消除歧义;(3)分别计算个人微博之间词汇和句子的相似度并将其综合得到博文相关度;(4)根据博文相关度进行个人微博的聚类。实验表明,相较于层次聚类法、密度聚类法,本文算法的准确度有明显提高。

关键词: 个人微博, 知网, 语义, 聚类, 消歧

Abstract:

Individual microblogs with large clustering correlation enable a quick understanding of  bloggers' professional interests and experiences. Existing short text clustering methods lack sufficient consideration of the correlation between semantics and sentences. We propose a novel individual microblog clustering method according to semantic correlation based on the HowNet. The main steps are as follows: (1) use the skipgram to train a large number of microblog texts to generate word vectors; (2) according to original semantic senses of words to eliminate ambiguity in the sentence; (3) calculate the similarity of words and sentences between microblogs respectively and get the correlation metrics; (4) cluster individual microblogs according to the microblog correlation. Experimental results show that the proposed clustering method outperforms the hierarchical clustering method and density clustering method.
 
 

Key words: individual microblog, HowNet, semantics, clustering, disambiguation