• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (1): 150-159.

• Artificial Intelligence and Data Mining • Previous Articles     Next Articles

An attention-guided dual-granularity cross-modal medical representation learning framework

CHEN Xinran1,LIU Ning1,YAN Zhongmin1,LIU Lei2,CUI Lizhen1   

  1. (1.School of Software,Shandong University,Jinan 250101;
    2.Shandong Research Institute of Industrial Technology,Jinan 250100,China)
  • Received:2024-06-27 Revised:2024-08-29 Online:2025-01-25 Published:2025-01-18

Abstract: Deep learning has achieved significant results in medical imaging diagnosis, and models based on deep neural networks can effectively assist doctors in making decisions. However, as the scale of model parameters gradually increases, large-scale parameter models in the medical domain are increasingly facing the challenge of data scarcity, as the labeling of high-quality medical image data requires professional physicians to manually complete. One solution is to introduce medical report guidance training paired with medical images, which involves the interaction of two modalities. However, cross-modal alignment methods in the general field lack capture of detailed information and cannot be fully applicable to the medical domain. To address this issue, an attention-guided dual-granularity cross-modal medical representation learning framework ADCRL is proposed to achieve alignment of medical images and reports at both coarse-grained and fine-grained levels. ADCRL can extract features from medical images and medical reports at two granularities, use an attention-guided module to select image regions of interest for medical tasks and remove noisy regions, and align two modalities at different granularities through contrastive learning based proxy tasks. ADCRL trains models under unsupervised paradigms to understand the global and detailed  semantics of two modalities, and demonstrates excellent performance in downstream tasks using only limited annotated data. The main work include proposing fine-grained feature selection methods and a dual-granularity cross-modal feature learning framework, and pretraining and validating the effectiveness of the framework on publicly available medical datasets.

Key words: deep learning, medical image, self-supervised learning, contrastive learning, pretraining model, data augmentation