An attention-guided dual-granularity cross-modal medical representation learning framework

Abstract

Abstract: Deep learning has achieved significant results in medical imaging diagnosis, and models based on deep neural networks can effectively assist doctors in making decisions. However, as the scale of model parameters gradually increases, large-scale parameter models in the medical domain are increasingly facing the challenge of data scarcity, as the labeling of high-quality medical image data requires professional physicians to manually complete. One solution is to introduce medical report guidance training paired with medical images, which involves the interaction of two modalities. However, cross-modal alignment methods in the general field lack capture of detailed information and cannot be fully applicable to the medical domain. To address this issue, an attention-guided dual-granularity cross-modal medical representation learning framework ADCRL is proposed to achieve alignment of medical images and reports at both coarse-grained and fine-grained levels. ADCRL can extract features from medical images and medical reports at two granularities, use an attention-guided module to select image regions of interest for medical tasks and remove noisy regions, and align two modalities at different granularities through contrastive learning based proxy tasks. ADCRL trains models under unsupervised paradigms to understand the global and detailed semantics of two modalities, and demonstrates excellent performance in downstream tasks using only limited annotated data. The main work include proposing fine-grained feature selection methods and a dual-granularity cross-modal feature learning framework, and pretraining and validating the effectiveness of the framework on publicly available medical datasets.

Key words: deep learning, medical image, self-supervised learning, contrastive learning, pretraining model, data augmentation

CHEN Xinran, LIU Ning, YAN Zhongmin, LIU Lei, CUI Lizhen. An attention-guided dual-granularity cross-modal medical representation learning framework[J]. Computer Engineering & Science, 2025, 47(1): 150-159.

[1]	CHEN Yuling, LI Xiang. Node classification with graph structure prompt in low-resource scenarios [J]. Computer Engineering & Science, 2025, 47(3): 534-547.
[2]	WANG Yang, XU Jiawei, WANG Ao, SONG Shijia, XIE Fan, ZHAO Chuanxin, JI Yimu. WiFi-based human activity recognition using cross-sequence prediction and consistency comparison [J]. Computer Engineering & Science, 2025, 47(1): 160-170.
[3]	LIU He-bing, KONG Yu-jie, XI Lei, SHANG Jun-ping. A decoupled contrastive clustering integrating attention mechanism [J]. Computer Engineering & Science, 2024, 46(12): 2261-2270.
[4]	TONG Yuan, YAO Nian-min. Entity relation extraction based on prejudgment and multi-round classification for span [J]. Computer Engineering & Science, 2024, 46(05): 916-928.
[5]	WU Yi-heng, LI Jun-hui, ZHU Mu-hua. Implicit discourse relation recognition with multi-view contrastive learning [J]. Computer Engineering & Science, 2024, 46(04): 716-724.
[6]	LI Qing-feng, JIN Liu, MA Hui-fang, ZHANG Ruo-yi. A dual-view contrastive learning-guided multi-behavior recommendation method [J]. Computer Engineering & Science, 2024, 46(04): 707-715.
[7]	QIU Xiao-meng, WANG Lin, GU Wen-jun, SONG Wei, TIAN Hao-lai, HU Yu. A time series image semantic segmentation model modified by optical flow [J]. Computer Engineering & Science, 2024, 46(01): 102-110.
[8]	ZHANG Wen-hao, QU Shao-jun. Retinal vessel segmentation based on multi-scale attention feature fusion network with dual-decoder structure [J]. Computer Engineering & Science, 2023, 45(12): 2175-2185.
[9]	LIU Cong-jun, XU Jia-chen, XIAO Zhi-yong, CHAI Zhi-lei. An automatic cardiac magnetic resonance image segmentation algorithm based on deep learning [J]. Computer Engineering & Science, 2022, 44(09): 1646-1654.
[10]	LIU Rong, WU Xin, AO Bin, WEN Qing, LI Kuan. Cell annotation refinement and adaptive weighted loss for CD56 image segmentation [J]. Computer Engineering & Science, 2022, 44(05): 870-878.
[11]	ZHU Tianyu，QUAN Huimin，LIU Guocai. Non-rigid medical image registration with bending energy and corresponding constraints of landmarks [J]. Computer Engineering & Science, 2019, 41(05): 851-857.
[12]	ZHANG Juan,JIANG Yun,HU Xue-wei,XIAO Ji-ze. A new medical image classification method based on convolution restricted Boltzmann machine [J]. Computer Engineering & Science, 2017, 39(02): 323-329.
[13]	HU Xuewei,JIANG Yun,ZOU Li,LI Zhilei,SHEN Jian. Medical image classification based on neighborhood relation fuzzy rough set [J]. J4, 2016, 38(04): 739-746.
[14]	JIAN Peng1,CHEN Zhigang1,DENG Xiaohong1,2,LIANG Diqing1,HUANG Weiqi1. A Data Hiding Scheme for Medical Images Based on Prediction Error Modification [J]. J4, 2012, 34(12): 105-109.
[15]	QIN Shigang1,2，LIU Jianxun1，WANG Junnian1. HIISC：An Efficient and Flexible Service Component for Health Image Integration [J]. J4, 2011, 33(1): 186-190.

An attention-guided dual-granularity cross-modal medical representation learning framework

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 15

Recommended Articles

Metrics

Comments