• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

Bilingual lexicon extraction based on
word vector and comparable corpus

LIU Lu-fang1,LI Bo1,CHEN Peng1,ZHOU Ling-han1,WANG Bing2   

  1. (1.School of Computer Science,Central China Normal University,Wuhan 430079;
    2.Beijing GEOWAY Software Co.,Ltd.,Beijing 100043,China)

     
  • Received:2017-08-10 Revised:2017-10-11 Online:2018-02-25 Published:2018-02-25

Abstract:

Bilingual lexicon is an important resource in natural language processing applications such as cross-language information retrieval and machine translation. The existing bilingual lexicon extraction algorithm based on comparable corpus is not mature enough and its extraction effect needs to be improved, and most researches focus on the extraction of professional terms in specific fields. In view of this shortcoming, this paper proposes a bilingual lexicon extraction algorithm based on word vector and comparable corpus. Firstly, the basic assumptions of the algorithm and the related research methods are given. Secondly, the concrete steps of extracting bilingual lexicon from the corpus are discussed based on the word vector. The final method is compared with the traditional vector space model. The effects of context window size, seed dictionary size, word frequency and other factors on the extraction efficiency of the two models areanalyzed experimentally. The experimental results show that, compared with the method based on the vector space model, the extraction effect of the algorithm is obviously improved, especially for the high frequency words.
 

Key words: bilingual lexicon, word vector, words&rsquo, correlation, comparable corpus