基于词向量与可比语料库的双语词典提取研究

计算机工程与科学

基于词向量与可比语料库的双语词典提取研究

柳路芳1，李波1，陈鹏1，周凌寒1，王兵2

（1.华中师范大学计算机学院，湖北武汉 430079；2.北京吉威时代软件股份有限公司，北京100043）

收稿日期:2017-08-10 修回日期:2017-10-11 出版日期:2018-02-25 发布日期:2018-02-25
基金资助:
国家语委十二五规划项目（YB125-132）；中央高校基本科研业务费专项项目（CCNU15A05062,CCNU17GF0005,CCNU16A06015）

Bilingual lexicon extraction based on

word vector and comparable corpus

LIU Lu-fang1,LI Bo1,CHEN Peng1,ZHOU Ling-han1,WANG Bing2

(1.School of Computer Science,Central China Normal University,Wuhan 430079;

2.Beijing GEOWAY Software Co.,Ltd.,Beijing 100043,China)

Received:2017-08-10 Revised:2017-10-11 Online:2018-02-25 Published:2018-02-25

摘要/Abstract

摘要：

双语词典是跨语言信息检索以及机器翻译等自然语言处理应用中的一项重要资源。现有的基于可比语料库的双语词典提取算法不够成熟，抽取效果有待提高，而且大多数研究都集中在特定领域的专业术语抽取。针对此不足，提出了一种基于词向量与可比语料库的双语词典提取算法。首先给出了该算法的基本假设以及相关的研究方法，然后阐述了基于词向量利用词间关系矩阵从可比语料库中提取双语词典的具体步骤,最后将该抽取方法与经典的向量空间模型做对比，通过实验分析了上下文窗口大小、种子词典大小、词频等因素对两种模型抽取效果的影响。实验表明，与基于向量空间模型的方法相比，本算法的抽取效果有着明显的提升，尤其是对于高频词语其准确率提升最为显著。

关键词: 双语词典, 词向量, 词间关系, 可比语料库

Abstract:

Bilingual lexicon is an important resource in natural language processing applications such as cross-language information retrieval and machine translation. The existing bilingual lexicon extraction algorithm based on comparable corpus is not mature enough and its extraction effect needs to be improved, and most researches focus on the extraction of professional terms in specific fields. In view of this shortcoming, this paper proposes a bilingual lexicon extraction algorithm based on word vector and comparable corpus. Firstly, the basic assumptions of the algorithm and the related research methods are given. Secondly, the concrete steps of extracting bilingual lexicon from the corpus are discussed based on the word vector. The final method is compared with the traditional vector space model. The effects of context window size, seed dictionary size, word frequency and other factors on the extraction efficiency of the two models areanalyzed experimentally. The experimental results show that, compared with the method based on the vector space model, the extraction effect of the algorithm is obviously improved, especially for the high frequency words.

Key words: bilingual lexicon, word vector, words&rsquo, correlation, comparable corpus

柳路芳1，李波1，陈鹏1，周凌寒1，王兵2. 基于词向量与可比语料库的双语词典提取研究[J]. 计算机工程与科学.

LIU Lu-fang1,LI Bo1,CHEN Peng1,ZHOU Ling-han1,WANG Bing2.

Bilingual lexicon extraction based on

word vector and comparable corpus

[J]. Computer Engineering & Science.

[1]	喻金平, 朱伟锋, 廖列法. 基于RoBERTa-wwm-BiLSTM-CRF的扶持政策文本实体识别研究[J]. 计算机工程与科学, 2023, 45(8): 1498-1507.
[2]	董芃杉, 张晶, 金日泽. 基于双通道门控复合网络的中文产品评论情感分析[J]. 计算机工程与科学, 2023, 45(5): 911-919.
[3]	排日旦·阿布都热依木, 吐尔地·托合提, 艾斯卡尔·艾木都拉, . 基于深度学习的实体关系抽取方法研究[J]. 计算机工程与科学, 2023, 45(5): 895-902.
[4]	李超, 严馨, 谢俊, 徐广义, 周枫, 莫源源, . 基于HDP的主题词向量构造——以柬语为例[J]. 计算机工程与科学, 2020, 42(6): 1111-1119.
[5]	蒋亚芳, 严馨, 徐广义, 周枫, 邓忠莹, . 融合多信息句子图模型的多文档摘要抽取[J]. 计算机工程与科学, 2020, 42(3): 535-542.
[6]	孙燕, 孙茂松, 赵海兴, 冶忠林, . 表示学习中句子与随机游走序列等价性的一种新证明[J]. 计算机工程与科学, 2020, 42(2): 373-380.
[7]	闫雄, 段跃兴, 张泽华. 采用自注意力机制和CNN融合的实体关系抽取[J]. 计算机工程与科学, 2020, 42(11): 2059-2066.
[8]	申强强，熊泽宇，熊岳山. 一种新的基于段向量的文本自动摘要方法[J]. 计算机工程与科学, 2019, 41(6): 1064-1070.
[9]	李俭兵1,2,3,刘栗材1,3. 基于改进型神经网络的影评文本情感分析算法[J]. 计算机工程与科学, 2019, 41(12): 2261-2269.
[10]	刘培磊，唐晋韬，王挺，谢松县，岳大鹏，刘海池. 基于词向量语义聚类的微博热点挖掘方法 [J]. 计算机工程与科学, 2018, 40(2): 313-319.
[11]	刘梦兰1,2，刘斌1,2,彭智勇1,2. 基于词向量的专利自动扩展查询研究[J]. 计算机工程与科学, 2017, 39(12): 2297-2305.
[12]	吴双，张文生，徐海瑞. 基于词间关系分析的文本特征选择算法[J]. J4, 2012, 34(6): 140-145.