• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

融合BTM和图论的微博检索模型

蔡晨1,2,罗可1,2   

  1. (1.长沙理工大学计算机与通信工程学院,湖南 长沙 410114;
    2.长沙理工大学综合交通运输大数据智能处理湖南省重点实验室,湖南 长沙 410114)
  • 收稿日期:2018-09-27 修回日期:2018-11-30 出版日期:2019-08-25 发布日期:2019-08-25
  • 基金资助:

    国家自然科学基金(11671125,71371065)

A microblog retrieval model combining BTM and graph theory

CAI Chen1,2,LUO Ke1,2   

  1. (1.School of Computer & Communication Engineering,Changsha University of Science and Technology,Changsha 410114;
    2.Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation,
    Changsha University of Science and Technology,Changsha 410114,China)
  • Received:2018-09-27 Revised:2018-11-30 Online:2019-08-25 Published:2019-08-25

摘要:

微博数据量庞大且微博文本的字符数少、特征稀疏,为提高检索精度,提出一种融合BTM和图论的微博检索模型,通过词汇语义相关度计算微博文本中带有标签的特征相关度,构建bi-term主题模型,用JSD距离计算映射到该模型中短文本的词对相关度,抽取CN-DBpedia中实体及图结构,再使用SimRank算法计算图结构中实体间的相关度。综上3种相关度为该模型最终相关度。最后使用新浪微博数据集进行检索实验,实验结果表明:对比于融合隐含狄利克雷分布算法与图论的检索模型和基于开放数据关联和图论方法系统模型,新模型在MAP、准确率和召回率上性能有明显提高,说明该模型具有较优的检索性能。
 

关键词: 微博, 短文本, 相似度计算, BTM, 图论, 主题模型

Abstract:

Microblogs have a large amount of data but a few characters in the text, and their features are sparse. In order to improve the retrieval precision, we propose a microblog retrieval model combining BTM and graph theory. The lexical semantic correlation is used to calculate the correlation between features with labels in microblog text. Then we construct a bi-term topic model, use JSD distance to calculate the correlation of pair words in the short text that mapped to the model. Thirdly, we extract the entity and graph structure in CN-DBpedia, and then use the SimRank algorithm to calculate inter-entity correlation between graph structures. The above three correlations are the final correlation of the model. Finally, the Sina Weibo data set is used for the retrieval experiments. Experimental results show that compared with the retrieval model based on the combination of the implicit Dirichlet distribution algorithm and graph theory and the system model based on open data correlation and graph theory, the performance of the new model is significantly improved in MAP, accuracy and recall rate, indicating that the model has better retrieval performance.
 

Key words: microblog, short text, similarity calculation, BTM, graph theory, topic model