• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (12): 2358-2365.

• 论文 • Previous Articles     Next Articles

Distributed representation of Chinese and Thai
words based on cross-lingual corpus   

ZHANG Jinpeng1,2,ZHOU Lanjiang1,2,XIAN Yantuan1,2,YU Zhengtao1,2,HE Silan3   

  1. (1.School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;
    2.The Key Laboratory of Intelligent Information Processing,
    Kunming University of Science and Technology,Kunming 650500;
    3.School of Science,Kunming University of Science and Technology,Kunming 650500,China)
  • Received:2015-08-20 Revised:2015-10-17 Online:2015-12-25 Published:2015-12-25

Abstract:

Word representation is the basic research content of natural language processing. At present, distributed representation of monolingual words has shown satisfactory application effect in some Neural Probabilistic Language (NPL) research, while as for distributed representation of crosslingual words, there is little research both at home and abroad. Aiming at this problem, given distribution similarity of nouns and verbs in these two languages, we embed mutual translated words, synonyms, superordinates into Chinese corpus by the weakly supervised learning extension approach and other methods, thus Thai word distribution in crosslingual environment of Chinese and Thai is learned. We applied the distributed representation of the crosslingual words learned before to compute similarities of bilingual texts and classify the mixed text corpus of Chinese and Thai. Experimental results show that the proposal has a satisfactory effect on the two tasks.

Key words: weakly supervised learning extension;cross-lingual corpus;cross-lingual word distribution representations;neural probabilistic language model