• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (12): 2261-2268.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于多任务学习和知识蒸馏的多模态蒙古语端到端的语音翻译

臧日成,高光来,飞龙   

  1. (1.内蒙古大学计算机学院(软件学院),内蒙古 呼和浩特 010021;
    2.内蒙古大学蒙古文智能信息处理技术国家地方联合工程研究中心,内蒙古 呼和浩特 010021;
    3.内蒙古自治区多语种人工智能技术重点实验室,内蒙古 呼和浩特  010021)

  • 收稿日期:2025-03-31 修回日期:2025-06-11 出版日期:2025-12-25 发布日期:2026-01-06
  • 基金资助:
    国家自然科学基金(62566045);内蒙古自治区科技计划(2025KYPT0041);内蒙古自治区一流学科科研专项(YLXKZX-ND-036)

Multimodal end-to-end Mongolian speech translation based on multi-task learning and knowledge distillation

ZANG Richeng,GAO Guanglai ,FEI Long   

  1. (1.College of Computer Science(College of Software),Inner Mongolia University,Hohhot 010021;
    2.National & Local Joint Engineering Research Center of Intelligent Information Processing Technology 
    for Mongolian,Inner Mongolia University,Hohhot 010021;
    3.Inner Mongolia Key Laboratory of Multilingual Artificial Intelligence Technology,Hohhot 010021,China)
  • Received:2025-03-31 Revised:2025-06-11 Online:2025-12-25 Published:2026-01-06

摘要: 端到端语音翻译技术,旨在实现从源语言到目标语言的自动转换,近年来在多个领域取得了显著进展。然而,在蒙古语的语音翻译方面,效果尚有待提升。其挑战主要源于蒙汉语音翻译数据集的稀缺,现有模型在处理蒙古语语音翻译任务时效果较差。为了克服这些难题,采取了以下措施:首先,收集并构建了一个大规模的蒙汉对照语音翻译数据集,以支持翻译模型的训练。其次,引入联合学习策略,通过编码器和解码器之间的参数共享,促进语音翻译与机器翻译任务之间的知识迁移。此外,为了缩小语音与文本之间的模态差异,采用了交叉注意力正则化方法,以增强模型对不同模态输入的理解和利用。通过知识蒸馏技术,动态更新机器翻译模型,进一步提升了语音翻译模型的性能。最后,集成语音合成模块,实现了从蒙古语语音到汉语语音的翻译。实验结果表明,所提模型在翻译准确率上取得了显著提升,与直接训练的语音翻译模型相比,其BLEU将近提升了2.00。


关键词: 蒙古语, 语音翻译;知识蒸馏;多任务学习

Abstract: End-to-end speech translation technology aims to realize the automatic conversion from source-language speech to target language, and has achieved significant progress in multiple fields in recent years. However, its performance in Mongolian speech translation still needs improvement. This challenge mainly stems from the scarcity of Mongolian-Chinese speech translation datasets, which leads to poor performance of existing models in handling Mongolian speech translation tasks. To overcome these difficulties, this study adopts the following measures: Firstly, a large-scale Mongolian-Chinese parallel speech translation dataset is collected and constructed to support the training of translation models. Secondly, a joint learning strategy is introduced; through parameter sharing between the encoder and decoder, knowledge transfer between speech translation and machine translation tasks is promoted. In addition, to narrow the modal gap between speech and text, a cross-attention regularization method is adopted to enhance the model's ability to understand and utilize inputs of different modalities. Through knowledge distillation technology, the machine translation model is dynamically updated, which further improves the performance of the speech translation model. Finally, a speech synthesis module is integrated to realize speech-to-speech translation. Experimental results show that the method proposed in this study achieves a significant improvement in translation accuracy: compared with the directly trained speech translation model, its BLEU score almost increased by 2.00.

Key words: Mongolian, speech translation;knowledge distillation, multi-task learning