• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (4): 751-760.

• 人工智能与数据挖掘 • 上一篇    

结合噪声数据增强的蒙汉伪平行语料库的构造

田永红,章钧津,宋哲煜   

  1. (内蒙古工业大学数据科学与应用学院,内蒙古 呼和浩特 010080) 
  • 收稿日期:2023-09-14 修回日期:2024-07-30 出版日期:2025-04-25 发布日期:2025-04-17
  • 基金资助:
    国家自然科学基金(62466043)

Construction of Mongolian-Chinese pseudo-parallel corpus enhanced by noisy data

TIAN Yonghong,ZHANG Junjin,SONG Zheyu   

  1. (College of Data Science and Application,Inner Mongolia University of Technology,Hohhot 010080,China) 
  • Received:2023-09-14 Revised:2024-07-30 Online:2025-04-25 Published:2025-04-17

摘要: 神经机器翻译作为机器翻译的主流方法在一般翻译任务中取得了较好的表现。然而其翻译质量依赖于大规模平行语料库,对于低资源语言,语料不足成为其发展面临的重要挑战。数据增强技术的出现能够有效解决数据稀缺问题,因此,通过将噪声数据引入反向翻译的方法进行数据增强构造伪平行语料库。首先对文本进行语料预处理,其次进行反向翻译和结合噪声数据后的反向翻译,再次进行文本相似度匹配,最后将反向翻译技术与结合噪声数据后的反向翻译技术进行对比。在实验数据集上的实验结果表明,结合噪声数据后的反向翻译技术有效提升了低资源机器翻译的表现,其翻译结果在BLEU指标上较仅使用反向翻译技术的提升了1.10%,较未使用反向翻译技术的提升了1.96%。

关键词: 数据增强, 噪声数据, 文本相似度匹配, 语料预处理

Abstract: Neural machine translation (NMT), as the mainstream approach in machine translation, has achieved excellent performance in general translation tasks. However, its translation quality relies heavily on large-scale parallel corpora. For low-resource languages, the scarcity of corpora poses a significant challenge to its development. The emergence of data augmentation techniques can effectively address the issue of data scarcity. Therefore, a pseudo-parallel corpus is constructed by introducing noisy data into back translation. Firstly, the text is pre-processed with corpus. Secondly, the back translation and the back translation combined with noisy data are carried out. Thirdly, the text acquaintance degree is matched. Finally, the back translation technology is compared with the back translation technology combined with noisy data. Experiments on experimental datasets show that the back translation technology combined with noisy data effectively improves the performance of low-resource machine translation. Specifically, its translation results achieve 1.10% improvement compared with those using the back translation technique alone on BLEU score  and 1.96% improvement compared with those not using the back translation technique at all.


Key words: data enhancement, noisy data, text similarity matching, corpus pre-processing