• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于词向量与可比语料库的双语词典提取研究

柳路芳1,李波1,陈鹏1,周凌寒1,王兵2   

  1. (1.华中师范大学计算机学院,湖北 武汉 430079;2.北京吉威时代软件股份有限公司,北京100043)
  • 收稿日期:2017-08-10 修回日期:2017-10-11 出版日期:2018-02-25 发布日期:2018-02-25
  • 基金资助:

    国家语委十二五规划项目(YB125-132);中央高校基本科研业务费专项项目(CCNU15A05062,CCNU17GF0005,CCNU16A06015)

Bilingual lexicon extraction based on
word vector and comparable corpus

LIU Lu-fang1,LI Bo1,CHEN Peng1,ZHOU Ling-han1,WANG Bing2   

  1. (1.School of Computer Science,Central China Normal University,Wuhan 430079;
    2.Beijing GEOWAY Software Co.,Ltd.,Beijing 100043,China)

     
  • Received:2017-08-10 Revised:2017-10-11 Online:2018-02-25 Published:2018-02-25

摘要:

双语词典是跨语言信息检索以及机器翻译等自然语言处理应用中的一项重要资源。现有的基于可比语料库的双语词典提取算法不够成熟,抽取效果有待提高,而且大多数研究都集中在特定领域的专业术语抽取。针对此不足,提出了一种基于词向量与可比语料库的双语词典提取算法。首先给出了该算法的基本假设以及相关的研究方法,然后阐述了基于词向量利用词间关系矩阵从可比语料库中提取双语词典的具体步骤,最后将该抽取方法与经典的向量空间模型做对比,通过实验分析了上下文窗口大小、种子词典大小、词频等因素对两种模型抽取效果的影响。实验表明,与基于向量空间模型的方法相比,本算法的抽取效果有着明显的提升,尤其是对于高频词语其准确率提升最为显著。
 

关键词: 双语词典, 词向量, 词间关系, 可比语料库

Abstract:

Bilingual lexicon is an important resource in natural language processing applications such as cross-language information retrieval and machine translation. The existing bilingual lexicon extraction algorithm based on comparable corpus is not mature enough and its extraction effect needs to be improved, and most researches focus on the extraction of professional terms in specific fields. In view of this shortcoming, this paper proposes a bilingual lexicon extraction algorithm based on word vector and comparable corpus. Firstly, the basic assumptions of the algorithm and the related research methods are given. Secondly, the concrete steps of extracting bilingual lexicon from the corpus are discussed based on the word vector. The final method is compared with the traditional vector space model. The effects of context window size, seed dictionary size, word frequency and other factors on the extraction efficiency of the two models areanalyzed experimentally. The experimental results show that, compared with the method based on the vector space model, the extraction effect of the algorithm is obviously improved, especially for the high frequency words.
 

Key words: bilingual lexicon, word vector, words&rsquo, correlation, comparable corpus