• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

基于短语的柬汉双语LDA主题模型

谢庆1,严馨1,诺宇1,徐广义2,周枫1,郭剑毅1   

  1. (1.昆明理工大学信息工程与自动化学院,云南 昆明 650504;2.云南南天电子信息产业股份有限公司,云南 昆明 650041)
  • 收稿日期:2018-07-03 修回日期:2018-11-08 出版日期:2019-08-25 发布日期:2019-08-25
  • 基金资助:

    国家自然科学基金(61462055,61562049)

A phrase-based Khmer-Chinese bilingual
LDA topic model construction method
 

XIE Qing1,YAN Xin1,NUO Yu1,XU Guang-yi2,ZHOU Feng1,GUO Jian-yi1   

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504;
    2.Yunnan Nantian Electronics Information Co.Ltd.,Kunming 650041,China)
     
  • Received:2018-07-03 Revised:2018-11-08 Online:2019-08-25 Published:2019-08-25

摘要:

为了有效地获取双语文档的主题分布,提出了一种基于短语的柬汉双语LDA主题模型。修改了传统LDA主题模型中的词袋模型,融入短语(N-gram)的概念,能够在主题预测过程中考虑文章的词序以及上下文,并将之应用于可比语料的双语环境中。本模型基于一个3层贝叶斯网络模型,在此框架下,首先搜集中文和柬埔寨语的可比语料,每一对双语可比语料文档共享一个相同的主题分布,之后引入发现主题以及主题短语的主题模型:对每个单词,首先进行主题抽样,然后将其状态作为短语进行采样,最后对来自特定主题短语分布的单词进行采样。通过实验结果可知,基于短语的双语LDA主题模型比一般的双语LDA模型更能抓住文章的主题,且有更好的主题预测能力。

关键词: 柬汉双语, 短语, 主题模型

Abstract:

In order to obtain the topic distribution of bilingual documents effectively, we propose a phrase-based Khmer-Chinese bilingual LDA topic model. We modify the bag-of-word model in the traditional LDA topic model and incorporate the concept of phrase (N-gram). The method considers the word order and context of the article in the topic prediction process and applies it to the bilingual environment of comparable corpus. It is based on a three-layer Bayesian network model. Under this framework, we firstly collect comparable Chinese and Khmer corpus, and each pair of bilingual comparable corpus shares a common topic distribution. And then we introduce the topic model of discovery topic and topic phrase: the topic of each word is firstly sampled; then its status is sampled as a phrase; and finally words from a particular topic phrase distribution are sampled. Experimental results show that the phrase-based bilingual LDA topic model is more capable of grasping the topic of the article than general bilingual LDA models and has better topic prediction ability.
 

Key words: Khmer-Chinese bilingual, phrase, topic model