• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (10): 1861-1868.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于回译和比例抽取孪生网络筛选的汉越平行语料扩充方法

王可超1,郭军军1,2,张亚飞1,2,高盛祥1,2,余正涛1,2   

  1. (1.昆明理工大学信息工程与自动化学院,云南 昆明 650500;
    2.昆明理工大学云南省人工智能重点实验室,云南 昆明 650500)
  • 收稿日期:2020-12-07 修回日期:2021-02-23 接受日期:2022-10-25 出版日期:2022-10-25 发布日期:2022-10-28
  • 基金资助:
    国家自然科学基金(61732005,61761026,61866020,61672271,61762056,61972186);国家重点研发计划(2019QY1801,2019QY1802,2019QY1800)

A ChineseVietnamese parallel corpus expansion method based on back translation and proportional extraction siamese network screening

WANG Kechao1,GUO Junjun1,2,ZHANG Ya-fei1,2,GAO Sheng-xiang1,2,YU Zheng-tao1,2   

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;
    2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)Abstract:As an important data enhancement method in translation, back translation has attracted more and more researchers attentions. The basic idea is to first train a basic translation model based on parallel corpus, then use the model to translate monolingual corpus into the target language, and combine it into a new corpus for model training. However, in the Chinese-Vietnamese low-resource scenario, the performance of the basic translation model obtained by training is poor, which results in the parallel corpus obtained by applying the back translation method on it contains more noise and is difficult to use for downstream tasks. In response to this problem, a siamese network screening model based on proportional extraction is constructed. Through training, the model can identify parallel sentence pairs and pseudo-parallel sentence pairs, and filter and denoise the pseudo-parallel corpus obtained by back translation in the same semantic space, thereby obtaining a better parallel corpus. The test results on the Chinese-Vietnamese data set show that the proposed method significantly outperforms the baseline system.


  • Received:2020-12-07 Revised:2021-02-23 Accepted:2022-10-25 Online:2022-10-25 Published:2022-10-28

摘要: 回译作为翻译中重要的数据增强方法,受到了越来越多研究者的关注。其基本思想为首先基于平行语料训练基础翻译模型,然后利用模型将单语语料翻译为目标语言,组合为新语料用于模型训练。然而在汉越低资源场景下,训练得到的基础翻译模型性能较差,导致在其上应用回译方法得到的平行语料中含有较多噪声,较难用于下游任务。针对此问题,构建基于比例抽取的孪生网络筛选模型,通过训练使得模型可以识别平行句对和伪平行句对,在同一语义空间上对回译得到的伪平行语料进行筛选去噪,进而得到更优的平行语料。在汉越数据集上的实验结果表明,所提方法训练的模型的性能相较基线模型有显著提升。

关键词: 汉越平行语料扩充, 回译, 数据增强, 比例抽取, 孪生网络

Abstract: Chinese-Vietnamese parallel corpus expansion;back translation;data enhancement;proportional extraction;siamese network