• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• • 上一篇    下一篇

基于CLIP和多模态数据融合的中医肾病诊断模型

张冬妍,张榄翔,吴晨旭,王立范,兰诚英,丁 昕, 岳 琪   

  1. (1.东北林业大学,计算机与控制工程学院, 黑龙江省哈尔滨市 150040;
    2. 黑龙江省中医药科学院,黑龙江省哈尔滨市 150040)

A classification model of TCM kidney disease based on CLIP and multimodal data fusion

ZHANG Dongyan , ZHANG Lanxiang , WU Chenxu, WANG Lifan ,LAN Chengying ,DING Xin1,YUE Qi   

  1. (1. College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040
    2. Heilongjiang Academy of Sciences of TCM, Harbin 150040, China) 

摘要: 为解决中医肾病诊治中因数据模态单一、病例症状相似而导致的诊断错误的问题,提出了一种基于CLIP模型和多模态数据融合的中医肾病诊断模型MLC-CLIP。该模型在CLIP框架基础上进行了系统性改进:引入ResNet50增强局部特征和层次化信息提取能力,设计多尺度特征提取模块强化CLIP图像编码器的特征提取能力;采用LSTM提升对中医病例文本的语义理解,优化文本编码结构;设计加权特征融合模块和跨模态门控注意力特征融合模块,结合分步融合策略优化多模态特征融合。在黑龙江省中医药科学院的肾病数据集上的实验表明,该模型在多模态肾病分类任务上,Accuracy、Precision、Recall、F1分数的结果为94.94%、95.07%、94.89%和94.98%,与原始CLIP模型相比,各项指标分别提升了10.68%、9.31%、9.58%和9.45%。实验结果证实,该模型能有效整合舌象图像和病例文本的多模态信息,在处理症状相似的复杂病例时表现出更强的区分能力,为中医肾病诊断提供了可靠的辅助决策支持。 

关键词: CLIP, 多模态分类, 图文特征融合, 注意力机制, 多尺度特征提取, 细粒度特征, LSTM

Abstract: To address the diagnostic inaccuracies in traditional Chinese medicine (TCM) nephropathy caused by single-modal data and similar clinical symptoms, this study proposes MLC-CLIP, a novel TCM nephropathy diagnosis model based on CLIP and multimodal data fusion. The model systematically enhances the CLIP framework by: introducing ResNet50 to improve local feature and hierarchical information extraction, designing a multi-scale feature extraction module to strengthen CLIP's image encoding capability, employing LSTM to enhance semantic understanding of TCM clinical texts, and optimizing text encoding structures. Additionally, a weighted feature fusion module and Cross-modal gating attention feature fusion module are developed, combined with a stepwise fusion strategy to optimize multimodal feature integration. Experimental results on the nephropathy dataset from Heilongjiang Academy of Traditional Chinese Medicine demonstrate that the model achieves significant performance improvements in multimodal classification tasks, with accuracy, precision, recall, and F1-score reaching 94.94%, 95.07%, 94.89%, and 94.98%, respectively, representing increases of 10.68%, 9.31%, 9.58%, and 9.45% over the original CLIP model. The results confirm that MLC-CLIP effectively integrates multimodal information from tongue images and clinical texts, demonstrating enhanced discriminative capability for complex cases with similar symptoms, thus providing reliable decision support for TCM nephropathy diagnosis.

Key words: CLIP, multimodal classification, image-text feature fusion, attention mechanism, multi-scale feature extraction, fine-grained features, LSTM