• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

主题模型在短文本上的应用研究

韩肖赟,侯再恩,孙绵   

  1. (陕西科技大学文理学院,陕西 西安 710021)
  • 收稿日期:2019-03-19 修回日期:2019-06-13 出版日期:2020-01-25 发布日期:2020-01-25
  • 基金资助:

    国家自然科学基金(11771259)

Research on the application of
topic  model in short text

HAN Xiao-yun,HOU Zai-en,SUN Mian   

  1. (School of Arts and Science,Shaanxi University of Science and Technology,Xi’an 710021,China)

     
  • Received:2019-03-19 Revised:2019-06-13 Online:2020-01-25 Published:2020-01-25

摘要:

针对短文本上以LDA为主的传统主题模型易受特征稀疏、噪声以及冗余影响的问题,首先梳理了文本特征表示法的变化以及短文本上主题模型的发展现状,并系统地总结了LDA模型和狄利克雷多项混合模型(DMM)各自的生成过程和相应的吉布斯采样参数推导。关于主题模型最优主题数,选取常见的4种优化指标进行了详细的对比说明。最后分析了近2年主题模型的扩展研究和其在网络舆情上的简单应用,并以此指明了未来主题模型的研究方向和侧重点。

关键词: 潜在狄利克雷分配模型, 狄利克雷多项混合模型, 短文本, 主题模型, 网络舆情;吉布斯采样

Abstract:

The paper aims at the problem that traditional LDA-based topic models on short texts are susceptible to sparseness, noise, and redundancy. Firstly, the changes of text feature representation and the development of topic models on short texts are reviewed. The generation process of the Latent Dirichlet Allocation (LDA) model and the Dirichlet Multinomial Mixture (DMM) model and the corresponding Gibbs sampling parameter derivation are systematically summarized. Regarding the optimal number of topics in the topic model, a detailed comparison of the four common optimization indicators is given. Finally, the extended research of the topic model in the past two years and its simple application in network public opinion are analyzed, and the research direction and focus of the future topic model are pointed out.

 

 

 

Key words: latent dirichlet allocation (LDA) model, dirichlet multinomial mixture (DMM) model, short text, topic model, internet public opinion, Gibbs sampling