• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

A phrase-based Khmer-Chinese bilingual
LDA topic model construction method
 

XIE Qing1,YAN Xin1,NUO Yu1,XU Guang-yi2,ZHOU Feng1,GUO Jian-yi1   

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504;
    2.Yunnan Nantian Electronics Information Co.Ltd.,Kunming 650041,China)
     
  • Received:2018-07-03 Revised:2018-11-08 Online:2019-08-25 Published:2019-08-25

Abstract:

In order to obtain the topic distribution of bilingual documents effectively, we propose a phrase-based Khmer-Chinese bilingual LDA topic model. We modify the bag-of-word model in the traditional LDA topic model and incorporate the concept of phrase (N-gram). The method considers the word order and context of the article in the topic prediction process and applies it to the bilingual environment of comparable corpus. It is based on a three-layer Bayesian network model. Under this framework, we firstly collect comparable Chinese and Khmer corpus, and each pair of bilingual comparable corpus shares a common topic distribution. And then we introduce the topic model of discovery topic and topic phrase: the topic of each word is firstly sampled; then its status is sampled as a phrase; and finally words from a particular topic phrase distribution are sampled. Experimental results show that the phrase-based bilingual LDA topic model is more capable of grasping the topic of the article than general bilingual LDA models and has better topic prediction ability.
 

Key words: Khmer-Chinese bilingual, phrase, topic model