• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

Construction of topic word embeddings
based on HDP: Khmer as an example

LI Chao1,2,YAN Xin1,2,XIE Jun1,2,XU Guang-yi3,ZHOU Feng1,2,MO Yuan-yuan4,5   

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology, Kunming 650504;
    2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650504;
    3.Yunnan Nantian Electronic Information Industry Co., Ltd., Kunming 650400;
    4.School of Southeast & South Asia Languages and Culture,Yunnan Minzu University,Kunming 650500;
    5.Institute of Linguistics,Shanghai Normal University,Shanghai 200234,China)

     
  • Received:2019-07-13 Revised:2019-12-11 Online:2020-06-25 Published:2020-06-25

Abstract:

Aiming at the problem of polysemy in a single word embedding, a topic word embeddings construction method on HDP (Hierarchical Dirichlet Process) is proposed in the case of Khmer. The method integrates the topic information on the basis of a single word embedding. In this way, the word topic tag is obtained through the HDP, and then it is regarded as a pseudo word and the word is input into the Skip-Gram model. Next, the topic word embeddings and the word embeddings are trained. Finally, the topic word embeddings of the text topic information is concatenated with the word embeddings obtained after the word training, and the topic word embedding of each word in the text is obtained. Compared with the word embeddings model that is not integrated into the topic information, this method achieves better results in terms of word similarity and text classification. Therefore, the topic word embeddings obtained in this paper has more semantic information.
 

Key words: HDP topic model, topic word embeddings, Skip-Gram model