• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (03): 546-553.

• 人工智能与数据挖掘 • 上一篇    下一篇

融合BERT与词嵌入双重表征的汉越神经机器翻译方法

张迎晨1,2,高盛祥1,2,余正涛1,2,王振晗1,2,毛存礼1,2   

  1. (1.昆明理工大学信息工程与自动化学院,云南 昆明 650500;
    2.昆明理工大学云南省人工智能重点实验室,云南 昆明 650500)
  • 收稿日期:2021-03-28 修回日期:2021-07-22 接受日期:2023-03-25 出版日期:2023-03-25 发布日期:2023-03-23
  • 基金资助:
    国家自然科学基金(61972186,U21B2027,61732005,61761026,61672271,61762056);国家重点研发计划(2019QY1802,2019QY1801,2019QY1800);云南高科技人才项目(201606);云南省重大科技专项(202103AA080015,202002AD080001);云南省基础研究计划(202001AS070014,2018FB104);昆明理工大学省级人培项目(KKSY201703005)

A Chinese-Vietnamese neural machine translation method using the dual representation of BERT and word embedding

ZHANG Ying-chen1,2,GAO Sheng-xiang1,2,YU Zheng-tao1,2,WANG Zhen-han1,2,MAO Cun-li1,2   

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;
    2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)
  • Received:2021-03-28 Revised:2021-07-22 Accepted:2023-03-25 Online:2023-03-25 Published:2023-03-23

摘要: 神经机器翻译是目前主流的机器翻译方法,但在汉-越这类低资源的机器翻译任务中,由于双语平行语料规模小,神经机器翻译的效果并不理想。考虑到预训练语言模型中包含丰富的语言信息,将预训练语言模型表征融入神经机器翻译系统可能对低资源的机器翻译有积极作用,提出一种融合BERT预训练语言模型与词嵌入双重表征的低资源神经机器翻译方法。首先,使用预训练语言模型和词嵌入分别对源语言序列进行表示学习,通过注意力机制建立2种表征之间的联系后,使用拼接操作得到双重表征向量;然后,经过线性变换和自注意力机制,使词嵌入表征和预训练语言模型表征完全自适应融合在一起,得到对输入文本的充分表征,以此提高神经机器翻译模型性能。在汉越语言对上的翻译结果表明,相比基准系统,在包含127 000个平行句对的汉越训练数据中,该方法的BLEU值提升了1.99,在包含70 000个平行句对的汉越训练数据中,该方法的BLEU值提升了4.34,表明融合BERT预训练语言模型和词嵌入双重表征的方法能够有效提升汉越机器翻译的性能。

关键词: 神经机器翻译, 预训练语言模型, 词嵌入, 汉语-越南语 ,  ,

Abstract: Neural machine translation is the current mainstream machine translation method. However, in low-resource machine translation tasks such as Chinese-Vietnamese, the effect of neural machine translation is not ideal due to the small scale of bilingual parallel corpus. Considering that the pre-trained language model contains rich language information, incorporating the pre-trained language model into a neural machine translation system may have a positive effect on low-resource machine translation. Therefore, this paper proposes a low-resource neural machine translation method that combines the dual representation of BERT pre-training language model and word embedding. The pre-training language model and word embedding are used to learn the representation of the source language sequence. The connection between the two representations are established through the attention mechanism. The splic- ing operation is performed  to obtain the dual representation vector. Through the linear transformation and self-attention mechanism, the word embedding representation and the pre-trained language model representation are fully adaptively fused together to obtain a sufficient representation of the input text, thereby improving the performance of the neural machine translation model. The translation experiment on the Chinese-Vietnamese language pair shows that, compared with the benchmark system, the method obtains an increase of 1.99 BLEU in the 127k-scale Chinese-Vietnamese training data, and an increase of 4.34 BLEU in the 70k-scale Chinese-Vietnamese training data, which proves that the fusion of BERT pre-training language model and dual representation of word embedding can effectively improve the performance of Chinese-Vietnamese machine translation.

Key words: neural machine translation, pre-trained language model, word embedding, Chinese- Vietnamese