• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (10): 1888-1900.

• 人工智能与数据挖掘 • 上一篇    

面向多模态情感分析的低秩跨模态Transformer

孙杰,车文刚,高盛祥   

  1. (昆明理工大学信息工程与自动化学院,云南 昆明 650500)

  • 收稿日期:2023-09-25 修回日期:2023-11-06 接受日期:2024-10-25 出版日期:2024-10-25 发布日期:2024-10-30
  • 基金资助:
    国家自然科学基金(61972186);云南省科技人才与平台计划(202105AC160018)

A low-rank cross-modal Transformer for multimodal sentiment analysis

SUN Jie,CHE Wen-gang,GAO Sheng-xiang   

  1. (Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China)
  • Received:2023-09-25 Revised:2023-11-06 Accepted:2024-10-25 Online:2024-10-25 Published:2024-10-30

摘要: 多模态情感分析将基于文本的方法扩展到包含视觉和语音信号的多模态环境,已成为情感计算领域的热门研究方向。在预训练-微调的背景下,将预训练语言模型微调到多模态情感分析领域是必要的。然而,微调大规模预训练语言模型仍然很昂贵,而且跨模态交互不足会影响性能。因此,提出低秩跨模态Transformer(LRCMT)来解决这些问题。受大型预训练语言模型在适应不同的自然语言处理下游任务时所呈现的低秩参数更新现象启发,LRCMT在每个冻结层中注入可训练的低秩参数矩阵,这大大减少了可训练参数,同时允许动态单词表示。此外,设计了跨模态交互模块,其中视觉和语音模态在与文本模态交互之前首先相互交互,从而实现更充分的跨模态融合。在多模态情感分析基准数据集上的大量实验表明了LRCMT的有效性和高效性。仅微调约全参数量0.76%的参数,LRCMT实现了与完全微调相当或更高的性能。此外,它还在许多指标上获得了最先进或具有竞争力的结果。消融实验表明,低秩微调与充分的跨模态交互有助于提升LRCMT的性能。总之,本文的工作降低了预训练语言模型在多模态任务上的微调成本,并为高效和有效的跨模态融合提供了思路。

关键词: 多模态, 情感分析, 预训练语言模型, 跨模态Transformer

Abstract: Multimodal sentiment analysis, which extends text-based affective computing to multimodal contexts with visual and speech modalities, is an emerging research area. In the pretrain-finetune paradigm, fine-tuning large pre-trained language models is necessary for good performance on multimodal sentiment analysis. However, fine-tuning large-scale pretrained language models is still prohibitively expensive and insufficient cross-modal interaction also hinders performance. Therefore, a low-rank cross-modal Transformer (LRCMT) is proposed to address these limitations. Inspired by the low-rank parameter updates exhibited by large pretrained models adapting to natural language tasks, LRCMT injects trainable low-rank matrices into frozen layers, significantly reducing trainable parameters while allowing dynamic word representations. Moreover, a cross-modal modules is designed where visual and speech modalities interact before fusing with the text. Extensive experiments on benchmarks demonstrate LRCMT's efficiency and effectiveness, achieving comparable or better performance than full fine-tuning by only tuning ~0.76% parameters. Furthermore, it also obtains state-of-the-art or competitive results on multiple metrics. Ablations validate that low-rank fine-tuning and sufficient cross-modal interaction contribute to LRCMT's strong performance. This paper reduces the fine-tuning cost and provides insights into efficient and effective cross-modal fusion.


Key words: multimodal, sentiment analysis, pretrained language model, cross-modal transformer