Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (4): 751-760.
• Artificial Intelligence and Data Mining • Previous Articles
TIAN Yonghong,ZHANG Junjin,SONG Zheyu
Received:
Revised:
Online:
Published:
Abstract: Neural machine translation (NMT), as the mainstream approach in machine translation, has achieved excellent performance in general translation tasks. However, its translation quality relies heavily on large-scale parallel corpora. For low-resource languages, the scarcity of corpora poses a significant challenge to its development. The emergence of data augmentation techniques can effectively address the issue of data scarcity. Therefore, a pseudo-parallel corpus is constructed by introducing noisy data into back translation. Firstly, the text is pre-processed with corpus. Secondly, the back translation and the back translation combined with noisy data are carried out. Thirdly, the text acquaintance degree is matched. Finally, the back translation technology is compared with the back translation technology combined with noisy data. Experiments on experimental datasets show that the back translation technology combined with noisy data effectively improves the performance of low-resource machine translation. Specifically, its translation results achieve 1.10% improvement compared with those using the back translation technique alone on BLEU score and 1.96% improvement compared with those not using the back translation technique at all.
Key words: data enhancement, noisy data, text similarity matching, corpus pre-processing
TIAN Yonghong, ZHANG Junjin, SONG Zheyu. Construction of Mongolian-Chinese pseudo-parallel corpus enhanced by noisy data[J]. Computer Engineering & Science, 2025, 47(4): 751-760.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2025/V47/I4/751