• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2024, Vol. 46 ›› Issue (10): 1888-1900.

• Artificial Intelligence and Data Mining • Previous Articles    

A low-rank cross-modal Transformer for multimodal sentiment analysis

SUN Jie,CHE Wen-gang,GAO Sheng-xiang   

  1. (Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China)
  • Received:2023-09-25 Revised:2023-11-06 Accepted:2024-10-25 Online:2024-10-25 Published:2024-10-30

Abstract: Multimodal sentiment analysis, which extends text-based affective computing to multimodal contexts with visual and speech modalities, is an emerging research area. In the pretrain-finetune paradigm, fine-tuning large pre-trained language models is necessary for good performance on multimodal sentiment analysis. However, fine-tuning large-scale pretrained language models is still prohibitively expensive and insufficient cross-modal interaction also hinders performance. Therefore, a low-rank cross-modal Transformer (LRCMT) is proposed to address these limitations. Inspired by the low-rank parameter updates exhibited by large pretrained models adapting to natural language tasks, LRCMT injects trainable low-rank matrices into frozen layers, significantly reducing trainable parameters while allowing dynamic word representations. Moreover, a cross-modal modules is designed where visual and speech modalities interact before fusing with the text. Extensive experiments on benchmarks demonstrate LRCMT's efficiency and effectiveness, achieving comparable or better performance than full fine-tuning by only tuning ~0.76% parameters. Furthermore, it also obtains state-of-the-art or competitive results on multiple metrics. Ablations validate that low-rank fine-tuning and sufficient cross-modal interaction contribute to LRCMT's strong performance. This paper reduces the fine-tuning cost and provides insights into efficient and effective cross-modal fusion.


Key words: multimodal, sentiment analysis, pretrained language model, cross-modal transformer