Construction of Mongolian-Chinese pseudo-parallel corpus enhanced by noisy data

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (4): 751-760.

• Artificial Intelligence and Data Mining • Previous Articles

Construction of Mongolian-Chinese pseudo-parallel corpus enhanced by noisy data

TIAN Yonghong,ZHANG Junjin,SONG Zheyu

(College of Data Science and Application,Inner Mongolia University of Technology,Hohhot 010080,China)

Received:2023-09-14 Revised:2024-07-30 Online:2025-04-25 Published:2025-04-17

Abstract

Abstract: Neural machine translation (NMT), as the mainstream approach in machine translation, has achieved excellent performance in general translation tasks. However, its translation quality relies heavily on large-scale parallel corpora. For low-resource languages, the scarcity of corpora poses a significant challenge to its development. The emergence of data augmentation techniques can effectively address the issue of data scarcity. Therefore, a pseudo-parallel corpus is constructed by introducing noisy data into back translation. Firstly, the text is pre-processed with corpus. Secondly, the back translation and the back translation combined with noisy data are carried out. Thirdly, the text acquaintance degree is matched. Finally, the back translation technology is compared with the back translation technology combined with noisy data. Experiments on experimental datasets show that the back translation technology combined with noisy data effectively improves the performance of low-resource machine translation. Specifically, its translation results achieve 1.10% improvement compared with those using the back translation technique alone on BLEU score and 1.96% improvement compared with those not using the back translation technique at all.

Key words: data enhancement, noisy data, text similarity matching, corpus pre-processing

TIAN Yonghong, ZHANG Junjin, SONG Zheyu. Construction of Mongolian-Chinese pseudo-parallel corpus enhanced by noisy data[J]. Computer Engineering & Science, 2025, 47(4): 751-760.

Construction of Mongolian-Chinese pseudo-parallel corpus enhanced by noisy data

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 1

Recommended Articles

Metrics

Comments