基于回译和比例抽取孪生网络筛选的汉越平行语料扩充方法

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (10): 1861-1868.

基于回译和比例抽取孪生网络筛选的汉越平行语料扩充方法

王可超1，郭军军1,2，张亚飞1,2，高盛祥1,2，余正涛1,2

（1.昆明理工大学信息工程与自动化学院，云南昆明 650500;
2.昆明理工大学云南省人工智能重点实验室，云南昆明 650500）

收稿日期:2020-12-07 修回日期:2021-02-23 出版日期:2022-10-25 发布日期:2022-10-28
基金资助:
国家自然科学基金（61732005，61761026，61866020，61672271，61762056,61972186）；国家重点研发计划（2019QY1801,2019QY1802,2019QY1800）

A ChineseVietnamese parallel corpus expansion method based on back translation and proportional extraction siamese network screening

WANG Kechao1,GUO Junjun1,2,ZHANG Ya-fei1,2,GAO Sheng-xiang1,2,YU Zheng-tao1,2

(1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;
2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)Abstract:As an important data enhancement method in translation, back translation has attracted more and more researchers attentions. The basic idea is to first train a basic translation model based on parallel corpus, then use the model to translate monolingual corpus into the target language, and combine it into a new corpus for model training. However, in the Chinese-Vietnamese low-resource scenario, the performance of the basic translation model obtained by training is poor, which results in the parallel corpus obtained by applying the back translation method on it contains more noise and is difficult to use for downstream tasks. In response to this problem, a siamese network screening model based on proportional extraction is constructed. Through training, the model can identify parallel sentence pairs and pseudo-parallel sentence pairs, and filter and denoise the pseudo-parallel corpus obtained by back translation in the same semantic space, thereby obtaining a better parallel corpus. The test results on the Chinese-Vietnamese data set show that the proposed method significantly outperforms the baseline system.

Received:2020-12-07 Revised:2021-02-23 Online:2022-10-25 Published:2022-10-28

摘要/Abstract

摘要： 回译作为翻译中重要的数据增强方法，受到了越来越多研究者的关注。其基本思想为首先基于平行语料训练基础翻译模型，然后利用模型将单语语料翻译为目标语言，组合为新语料用于模型训练。然而在汉越低资源场景下，训练得到的基础翻译模型性能较差，导致在其上应用回译方法得到的平行语料中含有较多噪声，较难用于下游任务。针对此问题，构建基于比例抽取的孪生网络筛选模型，通过训练使得模型可以识别平行句对和伪平行句对，在同一语义空间上对回译得到的伪平行语料进行筛选去噪，进而得到更优的平行语料。在汉越数据集上的实验结果表明，所提方法训练的模型的性能相较基线模型有显著提升。

关键词: 汉越平行语料扩充, 回译, 数据增强, 比例抽取, 孪生网络

Abstract: Chinese-Vietnamese parallel corpus expansion;back translation;data enhancement;proportional extraction;siamese network

王可超, 郭军军, 张亚飞, 高盛祥, 余正涛, . 基于回译和比例抽取孪生网络筛选的汉越平行语料扩充方法[J]. 计算机工程与科学, 2022, 44(10): 1861-1868.

WANG Kechao, GUO Junjun, ZHANG Ya-fei, GAO Sheng-xiang, YU Zheng-tao, . A ChineseVietnamese parallel corpus expansion method based on back translation and proportional extraction siamese network screening[J]. Computer Engineering & Science, 2022, 44(10): 1861-1868.

[1]	田永红, 章钧津, 宋哲煜. 结合噪声数据增强的蒙汉伪平行语料库的构造[J]. 计算机工程与科学, 2025, 47(04): 751-760.
[2]	陈欣然, 刘宁, 闫中敏, 刘磊, 崔立真. 基于注意力指导的双粒度跨模态医学特征学习框架[J]. 计算机工程与科学, 2025, 47(01): 150-159.
[3]	刘合兵, 孔玉杰, 席磊, 尚俊平. 融合注意力机制的解耦对比聚类[J]. 计算机工程与科学, 2024, 46(12): 2261-2270.
[4]	罗月童, 段昶, 江佩峰, 周波. 一种基于pix2pix改进的工业缺陷数据增强方法[J]. 计算机工程与科学, 2022, 44(12): 2206-2212.
[5]	霍爱清, 李易. 地面箭头标识线检测的改进M2Det算法[J]. 计算机工程与科学, 2022, 44(06): 1090-1096.
[6]	尤丛丛, 高盛祥, 余正涛, 毛存礼, 潘润海, . 基于同义词数据增强的汉越神经机器翻译方法[J]. 计算机工程与科学, 2021, 43(08): 1497-1502.
[7]	贾承勋, 赖华, 余正涛, 文永华, 于志强, . 基于枢轴语言的汉越神经机器翻译伪平行语料生成[J]. 计算机工程与科学, 2021, 43(03): 542-550.
[8]	蒋芸，张海，陈莉，陶生鑫. 基于卷积神经网络的图像数据增强算法[J]. 计算机工程与科学, 2019, 41(11): 2007-2016.