• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (11): 2056-2066.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于视觉特征增强与双向交互融合的图文情绪分类

王露瑶,胡慧君,刘茂福   

  1. (1.武汉科技大学计算机科学与技术学院,湖北 武汉 430065;
    2.湖北省智能信息处理与实时工业系统重点实验室,湖北 武汉 430065)

  • 收稿日期:2024-02-27 修回日期:2024-07-13 出版日期:2025-11-25 发布日期:2025-12-08
  • 基金资助:
    “十四五”湖北省优势特色学科(群)项目 (2023D0302)

Imagetext emotion classification based on visual feature enhancement and bidirectional interaction fusion

WANG Luyao,HU Huijun,LIU Maofu   

  1. (1.School of Computer Science and Technology,Wuhan University of Science and Technology,Wuhan 430065;
    2.Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System,Wuhan 430065,China)
  • Received:2024-02-27 Revised:2024-07-13 Online:2025-11-25 Published:2025-12-08

摘要: 多模态情感分析日益受到广泛关注,其目的是利用文本和图像等多模态信息实现情感预测。相较于文本,视觉模态作为辅助模态,可能包含更多与情感无关的混淆或者冗余信息,同时现有研究未充分考虑多个感知模态间的交互作用和互补性。针对上述问题,提出了基于视觉特征增强与双向交互融合的图文情绪分类VFEBIF模型。其中,细粒度视觉特征增强模块利用场景图的结构化知识和基于CLIP的筛选技术,提取出与视觉语义相关的文本关键词,从而增强视觉局部特征。此外,双向交互融合模块并行实现模态间交互,并融合多模态特征以深入挖掘模态间的互补信息,进而实现情绪分类。在TumEmo和MVSA-Single这2个公共数据集上的实验表明,VFEBIF模型优于多数现有模型,能够有效提升情绪分类性能。


关键词: 多模态情感分析, 图文情绪分类, 视觉特征增强, 双向交互融合

Abstract: Multimodal sentiment analysis is increasingly receiving widespread attention, with the aim of utilizing multimodal information such as text and images to achieve emotion prediction. Compared to text, the visual modality, as an auxiliary modality, may contain more redundant or confounding information unrelated to emotions, and existing research does not fully consider the interaction and complementarity between multiple perceptual modalities. To address these issues, an imagetext emotion classification model based on visual feature enhancement and bidirectional interactive fusion (VFEBIF) is proposed. In this approach, the fine-grained visual feature enhancement module utilizes structured knowledge from scene graphs and filtering techniques based on CLIP to extract keywords from the text related to visual semantics, thereby enhancing local visual features. Additionally, the bidirectional interactive fusion module implements inter-modal interaction in parallel, and fuses multimodal features to thoroughly explore complementary information between text and image, thus achieving emotion classification. Experiments on two public datasets, TumEmo and MVSA-Single, demonstrate that the VFEBIF method outperforms most existing approaches and can effectively improve the performance of sentiment classification.

Key words: multimodal sentiment analysis, imagetext emotion classification, visual feature enhancement, bidirectional interactive fusion