基于CLIP和多模态数据融合的中医肾病诊断模型

摘要/Abstract

摘要： 为解决中医肾病诊治中因数据模态单一、病例症状相似而导致的诊断错误的问题，提出了一种基于CLIP模型和多模态数据融合的中医肾病诊断模型MLC-CLIP。该模型在CLIP框架基础上进行了系统性改进：引入ResNet50增强局部特征和层次化信息提取能力，设计多尺度特征提取模块强化CLIP图像编码器的特征提取能力；采用LSTM提升对中医病例文本的语义理解，优化文本编码结构；设计加权特征融合模块和跨模态门控注意力特征融合模块，结合分步融合策略优化多模态特征融合。在黑龙江省中医药科学院的肾病数据集上的实验表明，该模型在多模态肾病分类任务上，Accuracy、Precision、Recall、F1分数的结果为94.94%、95.07%、94.89%和94.98%，与原始CLIP模型相比，各项指标分别提升了10.68%、9.31%、9.58%和9.45%。实验结果证实，该模型能有效整合舌象图像和病例文本的多模态信息，在处理症状相似的复杂病例时表现出更强的区分能力，为中医肾病诊断提供了可靠的辅助决策支持。

关键词: CLIP, 多模态分类, 图文特征融合, 注意力机制, 多尺度特征提取, 细粒度特征, LSTM

Abstract: To address the diagnostic inaccuracies in traditional Chinese medicine (TCM) nephropathy caused by single-modal data and similar clinical symptoms, this study proposes MLC-CLIP, a novel TCM nephropathy diagnosis model based on CLIP and multimodal data fusion. The model systematically enhances the CLIP framework by: introducing ResNet50 to improve local feature and hierarchical information extraction, designing a multi-scale feature extraction module to strengthen CLIP's image encoding capability, employing LSTM to enhance semantic understanding of TCM clinical texts, and optimizing text encoding structures. Additionally, a weighted feature fusion module and Cross-modal gating attention feature fusion module are developed, combined with a stepwise fusion strategy to optimize multimodal feature integration. Experimental results on the nephropathy dataset from Heilongjiang Academy of Traditional Chinese Medicine demonstrate that the model achieves significant performance improvements in multimodal classification tasks, with accuracy, precision, recall, and F1-score reaching 94.94%, 95.07%, 94.89%, and 94.98%, respectively, representing increases of 10.68%, 9.31%, 9.58%, and 9.45% over the original CLIP model. The results confirm that MLC-CLIP effectively integrates multimodal information from tongue images and clinical texts, demonstrating enhanced discriminative capability for complex cases with similar symptoms, thus providing reliable decision support for TCM nephropathy diagnosis.

Key words: CLIP, multimodal classification, image-text feature fusion, attention mechanism, multi-scale feature extraction, fine-grained features, LSTM

张冬妍, 张榄翔, 吴晨旭, 王立范, 兰诚英, 丁昕, 岳琪. 基于CLIP和多模态数据融合的中医肾病诊断模型[J]. 计算机工程与科学.

ZHANG Dongyan , ZHANG Lanxiang , WU Chenxu, WANG Lifan , LAN Chengying , DING Xin1, YUE Qi. A classification model of TCM kidney disease based on CLIP and multimodal data fusion[J]. Computer Engineering & Science.

[1]	陆顺意, 何庆. 一种基于预训练语言模型的多特征融合文章对匹配模型[J]. 计算机工程与科学, 2026, 48(2): 363-371.
[2]	王静, 马慧芳, 张梦媛. 基于知识点会话感知的知识追踪方法[J]. 计算机工程与科学, 2026, 48(1): 180-190.
[3]	李志鹏1, 陈丹阳1, 2, 钟诚1, 2. 一种适合大面积破损图像的多重修复网络[J]. 计算机工程与科学, 2025, 47(9): 1638-1646.
[4]	吐尔地·托合提1, 2, 罗长虹1, 2, 艾斯卡尔·艾木都拉1, 2. 文本问答中基于双向叠加注意力的证据区间预测[J]. 计算机工程与科学, 2025, 47(8): 1470-1482.
[5]	刘畅, 徐炜遐. CNN-ViTAMR：一种基于Transformer的自动信号调制识别算法及其轻量化实现#br#[J]. 计算机工程与科学, 2025, 47(8): 1408-1416.
[6]	张凤1, 邵玉斌1, 杜庆治1, 龙华1, 马迪南2. 基于双通道图卷积网络的多模态方面级情感分析[J]. 计算机工程与科学, 2025, 47(7): 1321-1330.
[7]	林毅1, 2, 3, 宋慧慧1, 2, 3. 用于全色锐化的金字塔特征解耦提取融合网络[J]. 计算机工程与科学, 2025, 47(7): 1262-1273.
[8]	李莉, 张晴, 孔悠然, 苏仁嘉, 赵鑫. 基于生成对抗网络的恶意代码变体家族溯源方法[J]. 计算机工程与科学, 2025, 47(7): 1215-1225.
[9]	陈俊彦1, 李欣梅1, 朱昌洪2, 肖微3. 基于多视图图注意力机制的软件定义光传输网络路由优化算法[J]. 计算机工程与科学, 2025, 47(7): 1193-1204.
[10]	敬容1, 万福成1, 2, 黄锐1, 于洪志1, 2, 马宁1, 2. 融合降噪微调与图注意力机制的藏文长文本分类[J]. 计算机工程与科学, 2025, 47(6): 1133-1140.
[11]	周丰峻, 康怀强, 高伸, 李锋, 孙云厚, 高航, 马芃晟. 基于改进的YOLOv8模型对地下工程混凝土裂纹的检测识别[J]. 计算机工程与科学, 2025, 47(6): 1079-1089.
[12]	马汉达, 李腾飞. 基于注意力机制的特征融合推荐模型[J]. 计算机工程与科学, 2025, 47(5): 902-911.
[13]	于致远, 宋慧慧, . 用于遥感图像时空融合的多尺度全聚合网络[J]. 计算机工程与科学, 2025, 47(5): 864-874.
[14]	王莹, 杨青, 王翔宇, 张勇, . 基于非对称空间特征的脑电信号情感分析研究[J]. 计算机工程与科学, 2025, 47(5): 921-930.
[15]	张梦圆, 端阳, 王彬彬, 张蕾, 吴裔, 刘畅, 郭乃网, 程大伟. 基于深度对抗网络的动态图生成模型研究[J]. 计算机工程与科学, 2025, 47(4): 728-739.

基于CLIP和多模态数据融合的中医肾病诊断模型

A classification model of TCM kidney disease based on CLIP and multimodal data fusion

PDF

可视化

摘要/Abstract

引用本文

使用本文

相关文章 15

编辑推荐

Metrics

本文评价