• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (01): 170-178.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于无监督预训练的跨语言AMR解析

范林雨,李军辉,孔芳   

  1. (苏州大学计算机科学与技术学院,江苏 苏州 215006)
  • 收稿日期:2022-10-21 修回日期:2022-12-05 接受日期:2024-01-25 出版日期:2024-01-25 发布日期:2024-01-15

Cross-lingual AMR parsing based on unsupervised pre-training

FAN Lin-yu,LI Jun-hui,KONG Fang   

  1. (School of Computer Science & Technology,Soochow University,Suzhou 215006,China)
  • Received:2022-10-21 Revised:2022-12-05 Accepted:2024-01-25 Online:2024-01-25 Published:2024-01-15

摘要: 抽象语义表示AMR是将给定文本的语义特征抽象成一个单根的有向无环图。由于缺乏非英文语言的AMR数据集,跨语言AMR解析通常指给定非英文目标语言文本,构建其英文翻译对应的AMR图。目前跨语言AMR解析的相关工作均基于大规模英文-目标语言平行语料或高性能英文-目标语言翻译模型,通过构建(英文,目标语言和AMR)三元平行语料进行目标语言的AMR解析。与该假设不同的是,本文探索在仅具备大规模单语英文和单语目标语言语料的情况下,实现跨语言AMR解析。为此,提出基于无监督预训练的跨语言AMR解析方法。具体地,在预训练过程中,融合无监督神经机器翻译任务、英文和目标语言AMR解析任务;在微调过程中,使用基于英文AMR 2.0转换的目标语言AMR数据集进行单任务微调。基于AMR 2.0和多语言AMR测试集的实验结果表明,所提方法在德文、西班牙文和意大利文上分别获得了67.89%, 68.04%和67.99%的Smatch F1值。

关键词: 跨语言AMR语义解析, 序列到序列模型, 预训练模型

Abstract: AMR (Abstract Meaning Representation) abstracts the semantic features of a given text into a single-root directed acyclic graph. Due to the lack of non-English language AMR datasets, cross-lingual AMR parsing aims to parse non-English text into the corresponding AMR graph of its English translation. Current cross-lingual AMR parsing methods rely on large-scale English-target language parallel corpora or high-performance English-target language translation models to build (English, target language, AMR) triplet parallel corpora for target language AMR parsing. In contrast to this assumption, this paper explores the possibility of achieving cross-lingual AMR parsing with only large-scale monolingual English and target language corpora.  To this end, we propose cross-lingual AMR parsing based on unsupervised pretraining. Specifically, during pretraining, we integrate unsupervised neural machine translation tasks, English AMR parsing tasks, and target language AMR parsing tasks. During fine-tuning, we use an English AMR2.0-based target language AMR dataset for single-task fine-tuning. Experimental results on AMR2.0 and a multilingual AMR test set show that our method achieves Smatch F1 scores of 67.89, 68.04, and 67.99 in German, Spanish, and Italian, respectively.


Key words: cross-lingual AMR parsing, seq2seq model, pre-trained model