• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2022, Vol. 44 ›› Issue (10): 1861-1868.

• Artificial Intelligence and Data Mining • Previous Articles     Next Articles

A ChineseVietnamese parallel corpus expansion method based on back translation and proportional extraction siamese network screening

WANG Kechao1,GUO Junjun1,2,ZHANG Ya-fei1,2,GAO Sheng-xiang1,2,YU Zheng-tao1,2   

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;
    2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)Abstract:As an important data enhancement method in translation, back translation has attracted more and more researchers attentions. The basic idea is to first train a basic translation model based on parallel corpus, then use the model to translate monolingual corpus into the target language, and combine it into a new corpus for model training. However, in the Chinese-Vietnamese low-resource scenario, the performance of the basic translation model obtained by training is poor, which results in the parallel corpus obtained by applying the back translation method on it contains more noise and is difficult to use for downstream tasks. In response to this problem, a siamese network screening model based on proportional extraction is constructed. Through training, the model can identify parallel sentence pairs and pseudo-parallel sentence pairs, and filter and denoise the pseudo-parallel corpus obtained by back translation in the same semantic space, thereby obtaining a better parallel corpus. The test results on the Chinese-Vietnamese data set show that the proposed method significantly outperforms the baseline system.


  • Received:2020-12-07 Revised:2021-02-23 Accepted:2022-10-25 Online:2022-10-25 Published:2022-10-28

Abstract: Chinese-Vietnamese parallel corpus expansion;back translation;data enhancement;proportional extraction;siamese network