• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (08): 1497-1502.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于同义词数据增强的汉越神经机器翻译方法

尤丛丛1,2,高盛祥1,2,余正涛1,2,毛存礼1,2,潘润海1,2


  

  1. (1.昆明理工大学信息工程与自动化学院,云南 昆明 650500;2.昆明理工大学云南省人工智能重点实验室,云南 昆明 650500)
  • 收稿日期:2020-02-18 修回日期:2020-07-12 接受日期:2021-08-25 出版日期:2021-08-25 发布日期:2021-08-24
  • 基金资助:
    国家重点研发计划(2019QY1801,2019QY1802,2019QY1800);国家自然科学基金(61761026,61972186,61732005,61672271,61762056);云南省高新技术产业专项(201606);云南省自然科学基金(2018FB104);昆明理工大学省级人培项目(KKSY201703005)

A Chinese-Vietnamese neural machine translation method based on synonym data augmentation

YOU Cong-cong1,2,GAO Sheng-xiang1,2,YU Zheng-tao1,2,MAO Cun-li1,2,PAN Run-hai1,2   

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;

    2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)

  • Received:2020-02-18 Revised:2020-07-12 Accepted:2021-08-25 Online:2021-08-25 Published:2021-08-24

摘要: 汉越平行语料库的资源稀缺,很大程度上影响了汉越机器翻译效果。数据增强是提升汉越机器翻译的有效途径,基于双语词典的词汇替换数据增强是当前较为流行的方法。由于汉语-越南语属于低资源语言对,双语词典难以获得,而通过单语词向量获取低频词的同义词较为容易。因此,提出一种基于低频词的同义词替换的数据增强方法。该方法利用小规模的平行语料,首先通过对单语词向量的学习,获得一端语言低频词的同义词列表;然后对低频词进行同义词替换,再利用语言模型对替换后的句子进行筛选;最后将筛选后的句子与另一端语言中的句子进行匹配,获得扩展的平行语料。汉越翻译对比实验结果表明,提出的方法取得了很好的效果,扩展后的方法比基准和回译方法在BLEU值上分别提高了1.8和1.1。

关键词: 汉越, 数据增强, 同义词替换, 神经机器翻译

Abstract: The scarcity of resources in the Chinese-Vietnamese parallel corpus greatly affects the effect of Chinese-Vietnamese machine translation. Data enhancement is an effective way to improve Chinese-Vietnamese machine translation. Bilingual dictionary-based vocabulary replacement and data enhancement is currently a more popular method. Since Chinese-Vietnamese 
bilingualism is a low-resource languages, bilingual dictionaries are difficult to obtain, and synonyms for low-frequency words are easier to obtain from monolingual word vectors. Therefore, we propose a data enhancement method based on synonym replacement of low-frequency words. This method uses a small-scale parallel corpus. Firstly, by learning monolingual word vectors, a synonym list of low-frequency words at one end is obtained. Then, low-frequency words are replaced with synonyms. Secondly, the language model is used to filter the replaced sentences. Finally, The filtered sentence is matched with the sentence in the language on the other side to obtain an extended parallel corpus. The experimental results of Chinese-Vietnamese translation experiments show that the proposed method achieves good results, and the extended method improves the BLEU value by 1.8 and 1.1, compared with the baseline and back translation methods.


Key words: Chinese-Vietnamese, data augmentation, synonym substitution, neural machine translation