• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (12): 2261-2268.

• Artificial Intelligence and Data Mining • Previous Articles     Next Articles

Multimodal end-to-end Mongolian speech translation based on multi-task learning and knowledge distillation

ZANG Richeng,GAO Guanglai ,FEI Long   

  1. (1.College of Computer Science(College of Software),Inner Mongolia University,Hohhot 010021;
    2.National & Local Joint Engineering Research Center of Intelligent Information Processing Technology 
    for Mongolian,Inner Mongolia University,Hohhot 010021;
    3.Inner Mongolia Key Laboratory of Multilingual Artificial Intelligence Technology,Hohhot 010021,China)
  • Received:2025-03-31 Revised:2025-06-11 Online:2025-12-25 Published:2026-01-06

Abstract: End-to-end speech translation technology aims to realize the automatic conversion from source-language speech to target language, and has achieved significant progress in multiple fields in recent years. However, its performance in Mongolian speech translation still needs improvement. This challenge mainly stems from the scarcity of Mongolian-Chinese speech translation datasets, which leads to poor performance of existing models in handling Mongolian speech translation tasks. To overcome these difficulties, this study adopts the following measures: Firstly, a large-scale Mongolian-Chinese parallel speech translation dataset is collected and constructed to support the training of translation models. Secondly, a joint learning strategy is introduced; through parameter sharing between the encoder and decoder, knowledge transfer between speech translation and machine translation tasks is promoted. In addition, to narrow the modal gap between speech and text, a cross-attention regularization method is adopted to enhance the model's ability to understand and utilize inputs of different modalities. Through knowledge distillation technology, the machine translation model is dynamically updated, which further improves the performance of the speech translation model. Finally, a speech synthesis module is integrated to realize speech-to-speech translation. Experimental results show that the method proposed in this study achieves a significant improvement in translation accuracy: compared with the directly trained speech translation model, its BLEU score almost increased by 2.00.

Key words: Mongolian, speech translation;knowledge distillation, multi-task learning