• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (4): 751-760.

• Artificial Intelligence and Data Mining • Previous Articles    

Construction of Mongolian-Chinese pseudo-parallel corpus enhanced by noisy data

TIAN Yonghong,ZHANG Junjin,SONG Zheyu   

  1. (College of Data Science and Application,Inner Mongolia University of Technology,Hohhot 010080,China) 
  • Received:2023-09-14 Revised:2024-07-30 Online:2025-04-25 Published:2025-04-17

Abstract: Neural machine translation (NMT), as the mainstream approach in machine translation, has achieved excellent performance in general translation tasks. However, its translation quality relies heavily on large-scale parallel corpora. For low-resource languages, the scarcity of corpora poses a significant challenge to its development. The emergence of data augmentation techniques can effectively address the issue of data scarcity. Therefore, a pseudo-parallel corpus is constructed by introducing noisy data into back translation. Firstly, the text is pre-processed with corpus. Secondly, the back translation and the back translation combined with noisy data are carried out. Thirdly, the text acquaintance degree is matched. Finally, the back translation technology is compared with the back translation technology combined with noisy data. Experiments on experimental datasets show that the back translation technology combined with noisy data effectively improves the performance of low-resource machine translation. Specifically, its translation results achieve 1.10% improvement compared with those using the back translation technique alone on BLEU score  and 1.96% improvement compared with those not using the back translation technique at all.


Key words: data enhancement, noisy data, text similarity matching, corpus pre-processing