基于提高伪平行句对质量的无监督领域适应机器翻译

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (12): 2230-2237.

基于提高伪平行句对质量的无监督领域适应机器翻译

肖妮妮，金畅，段湘煜

（苏州大学计算机科学与技术学院自然语言处理实验室，江苏苏州 215006）

收稿日期:2021-04-26 修回日期:2021-09-13 出版日期:2022-12-25 发布日期:2023-01-05

Unsupervised domain-adapted machine translation based on improving the quality of pseudo-parallel sentence pairs

XIAO Ni-ni,JIN Chang,DUAN Xiang-yu

(Natural Language Processing Laboratory,School of Computer Science and Technology,
Soochow University,Suzhou 215006,China)

Received:2021-04-26 Revised:2021-09-13 Online:2022-12-25 Published:2023-01-05

摘要/Abstract

摘要： 神经机器翻译系统的良好性能依赖于大规模内领域双语平行数据，当特定领域数据稀疏或不存在时，领域适应是个很好的解决办法。无监督领域适应方法通过构建伪平行语料来微调预训练的翻译模型，然而现有的方法没有充分考虑语言的语义、情感等特性，导致目标领域的翻译包含大量的错误和噪声，从而影响到模型的跨领域性能。为缓解这一问题，从模型和数据2个方面来提高伪平行句对的质量，以提高模型的领域自适应能力。首先，提出更加合理的预训练策略来学习外领域数据更为通用的文本表示，增强模型的泛化能力，同时提高内领域的译文准确性；然后，融合句子的情感信息进行后验筛选，进一步改善伪语料的质量。实验表明，该方法在中-英与英-中实验上比强基线系统反向翻译的BLEU值分别提高了1.25和 1.38，可以有效地提高翻译效果。

关键词: 神经网络, 神经机器翻译, 领域适应, 模型优化, 情感信息

Abstract: The good performance of neural machine translation system depends on a large amount of in-domain bilingual parallel data. Domain adaptation is a good solution when the specific domain data is sparse or non-existent. Unsupervised domain adaptation strategies fine-tune the pre-trained translation models by generating pseudo-parallel corpus. However, existing methods do not consider the semantic and emotional characteristics of the languages sufficiently, resulting in a lot of errors and noises in the target domain translation, which affects the cross-domain performance of the model. To alleviate this problem, this paper improves the quality of pseudo-parallel sentence pairs by combining model and data, so as to improve the adaptive ability of the model domain. Firstly, a more reasonable pre-training strategy is proposed to learn more general textual representations of out-domain data, in order to enhance the generalization capability of the model and improve the accuracy of the generated in-domain pseudo- corpus. Then, sentence sentiment features are combined to do posteriori filtering, in order to improve the quality of pseudo-parallel corpus. The experimental results show that, compared with the strong baseline system with back-translation, this method increases the BLEU value by 1.25 and 1.38 respectively in the Chinese-English and English-Chinese translation experiments, thus effectively improving the translation performance.

Key words: neural network, neural machine translation, domain adaptation, model optimization, sentiment information

肖妮妮, 金畅, 段湘煜. 基于提高伪平行句对质量的无监督领域适应机器翻译[J]. 计算机工程与科学, 2022, 44(12): 2230-2237.

XIAO Ni-ni, JIN Chang, DUAN Xiang-yu. Unsupervised domain-adapted machine translation based on improving the quality of pseudo-parallel sentence pairs[J]. Computer Engineering & Science, 2022, 44(12): 2230-2237.

[1]	姜云卓, 贡正仙. 基于修辞结构的篇章级神经机器翻译[J]. 计算机工程与科学, 2025, 47(01): 180-190.
[2]	杜连成, 郭军军, 叶俊杰, 余正涛, . 双级交互式自适应融合的多模态神经机器翻译[J]. 计算机工程与科学, 2024, 46(11): 2071-2080.
[3]	申影利, 赵小兵, . 语言模型蒸馏的低资源神经机器翻译方法[J]. 计算机工程与科学, 2024, 46(04): 743-751.
[4]	陈欢欢, 王剑, Muhammad Naeem Ul Hassan. 融合乌尔都语词性序列预测的汉乌神经机器翻译[J]. 计算机工程与科学, 2024, 46(03): 518-524.
[5]	张迎晨, 高盛祥, 余正涛, 王振晗, 毛存礼, . 融合BERT与词嵌入双重表征的汉越神经机器翻译方法[J]. 计算机工程与科学, 2023, 45(03): 546-553.
[6]	王煦, 贾浩, 季佰军, 段湘煜. 基于词典模型融合的神经机器翻译[J]. 计算机工程与科学, 2022, 44(08): 1481-1487.
[7]	薛擎天, 李军辉, 贡正仙, 徐东钦. 基于预训练的无监督神经机器翻译模型研究[J]. 计算机工程与科学, 2022, 44(04): 730-736.
[8]	尤丛丛, 高盛祥, 余正涛, 毛存礼, 潘润海, . 基于同义词数据增强的汉越神经机器翻译方法[J]. 计算机工程与科学, 2021, 43(08): 1497-1502.
[9]	贾承勋, 赖华, 余正涛, 文永华, 于志强, . 基于枢轴语言的汉越神经机器翻译伪平行语料生成[J]. 计算机工程与科学, 2021, 43(03): 542-550.
[10]	史小静, 宁秋怡, 季佰军, 段湘煜. 信息传递增强的神经机器翻译[J]. 计算机工程与科学, 2021, 43(01): 134-141.
[11]	陈诚1，郭卫斌1，李庆瑜2. 结合自注意力的对抗性领域适应图像分类方法[J]. 计算机工程与科学, 2020, 42(02): 259-265.
[12]	许浩，郭卫斌. 带有双判别器的对抗性领域适应图像分类算法[J]. 计算机工程与科学, 2019, 41(09): 1656-1661.
[13]	肖新凤1,2，李石君2，余伟2，刘杰2，刘倍雄1. 基于改进seq2seq模型的英汉翻译研究[J]. 计算机工程与科学, 2019, 41(07): 1257-1265.