基于跨语言语料的汉泰词分布表示

J4 ›› 2015, Vol. 37 ›› Issue (12): 2358-2365.

基于跨语言语料的汉泰词分布表示

张金鹏1,2，周兰江1,2，线岩团1,2，余正涛1,2，何思兰3

(1.昆明理工大学信息工程与自动化学院，云南昆明 650500；
2.昆明理工大学智能信息处理重点实验室，云南昆明 650500;3.昆明理工大学理学院，云南昆明 650500)

收稿日期:2015-08-20 修回日期:2015-10-17 出版日期:2015-12-25 发布日期:2015-12-25
基金资助:
国家自然科学基金资助项目(61363044)

Distributed representation of Chinese and Thai
words based on cross-lingual corpus

ZHANG Jinpeng1,2,ZHOU Lanjiang1,2,XIAN Yantuan1,2,YU Zhengtao1,2,HE Silan3

(1.School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;
2.The Key Laboratory of Intelligent Information Processing,
Kunming University of Science and Technology,Kunming 650500;
3.School of Science,Kunming University of Science and Technology,Kunming 650500，China)

Received:2015-08-20 Revised:2015-10-17 Online:2015-12-25 Published:2015-12-25

摘要/Abstract

摘要：

词汇的表示问题是自然语言处理的基础研究内容。目前单语词汇分布表示已经在一些自然语言处理问题上取得很好的应用效果，然而在跨语言词汇的分布表示上国内外研究很少，针对这个问题，利用两种语言名词、动词分布的相似性，通过弱监督学习扩展等方式在中文语料中嵌入泰语的互译词、同类词、上义词等，学习出泰语词在汉泰跨语言环境下的分布。实验基于学习到的跨语言词汇分布表示应用于双语文本相似度计算和汉泰混合语料集文本分类，均取得较好效果。

关键词: 弱监督学习扩展, 跨语言语料, 跨语言词汇分布表示, 神经概率语言模型

Abstract:

Word representation is the basic research content of natural language processing. At present, distributed representation of monolingual words has shown satisfactory application effect in some Neural Probabilistic Language (NPL) research, while as for distributed representation of crosslingual words, there is little research both at home and abroad. Aiming at this problem, given distribution similarity of nouns and verbs in these two languages, we embed mutual translated words, synonyms, superordinates into Chinese corpus by the weakly supervised learning extension approach and other methods, thus Thai word distribution in crosslingual environment of Chinese and Thai is learned. We applied the distributed representation of the crosslingual words learned before to compute similarities of bilingual texts and classify the mixed text corpus of Chinese and Thai. Experimental results show that the proposal has a satisfactory effect on the two tasks.

Key words: weakly supervised learning extension;cross-lingual corpus;cross-lingual word distribution representations;neural probabilistic language model

张金鹏1,2，周兰江1,2，线岩团1,2，余正涛1,2，何思兰3. 基于跨语言语料的汉泰词分布表示[J]. J4, 2015, 37(12): 2358-2365.

ZHANG Jinpeng1,2,ZHOU Lanjiang1,2,XIAN Yantuan1,2,YU Zhengtao1,2,HE Silan3. Distributed representation of Chinese and Thai
words based on cross-lingual corpus [J]. J4, 2015, 37(12): 2358-2365.