• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

基于HDP的主题词向量构造——以柬语为例

李超1,2,严馨1,2,谢俊1,2,徐广义3,周枫1,2,莫源源4,5   

  1. (1.昆明理工大学信息工程与自动化学院,云南 昆明 650504;2.昆明理工大学云南省人工智能重点实验室,云南 昆明 650504;
    3.云南南天电子信息产业股份有限公司,云南 昆明 650400;4.云南民族大学东南亚南亚语言文化学院,云南 昆明 650500;
    5.上海师范大学语言研究所,上海 200234)
  • 收稿日期:2019-07-13 修回日期:2019-12-11 出版日期:2020-06-25 发布日期:2020-06-25
  • 基金资助:

    国家自然科学基金(61462055,61562049)

Construction of topic word embeddings
based on HDP: Khmer as an example

LI Chao1,2,YAN Xin1,2,XIE Jun1,2,XU Guang-yi3,ZHOU Feng1,2,MO Yuan-yuan4,5   

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology, Kunming 650504;
    2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650504;
    3.Yunnan Nantian Electronic Information Industry Co., Ltd., Kunming 650400;
    4.School of Southeast & South Asia Languages and Culture,Yunnan Minzu University,Kunming 650500;
    5.Institute of Linguistics,Shanghai Normal University,Shanghai 200234,China)

     
  • Received:2019-07-13 Revised:2019-12-11 Online:2020-06-25 Published:2020-06-25

摘要:

针对单一词向量中存在的一词多义和一义多词的问题,以柬语为例提出了一种基于HDP主题模型的主题词向量的构造方法。在单一词向量基础上融入了主题信息,首先通过HDP主题模型得到单词主题标签,然后将其视为伪单词与单词一起输入Skip-Gram模型,同时训练出主题向量和词向量,最后将文本主题信息的主题向量与单词训练后得到的词向量进行级联,获得文本中每个词的主题词向量。与未融入主题信息的词向量模型相比,该方法在单词相似度和文本分类方面均取得了更好的效果,获取的主题词向量具有更多的语义信息。

关键词: HDP主题模型, 主题词向量, Skip-Gram模型

Abstract:

Aiming at the problem of polysemy in a single word embedding, a topic word embeddings construction method on HDP (Hierarchical Dirichlet Process) is proposed in the case of Khmer. The method integrates the topic information on the basis of a single word embedding. In this way, the word topic tag is obtained through the HDP, and then it is regarded as a pseudo word and the word is input into the Skip-Gram model. Next, the topic word embeddings and the word embeddings are trained. Finally, the topic word embeddings of the text topic information is concatenated with the word embeddings obtained after the word training, and the topic word embedding of each word in the text is obtained. Compared with the word embeddings model that is not integrated into the topic information, this method achieves better results in terms of word similarity and text classification. Therefore, the topic word embeddings obtained in this paper has more semantic information.
 

Key words: HDP topic model, topic word embeddings, Skip-Gram model