• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2026, Vol. 48 ›› Issue (3): 456-466.

• 图形与图像 • 上一篇    下一篇

任务提示融合的端到端视觉多任务学习模型

耿焕同,范子辰,蒋骏,刘振宇,李嘉兴   

  1. (1.南京信息工程大学计算机学院(网络空间安全学院),江苏 南京 210044;
    2.南京信息工程大学软件学院,江苏 南京 210044)

  • 收稿日期:2024-03-22 修回日期:2024-08-05 出版日期:2026-03-25 发布日期:2026-03-25
  • 基金资助:
    国家自然科学基金(42375145)

An end-to-end visual multi-task learning model for task prompts fusion

GENG Huantong,FAN Zichen,JIANG Jun,LIU Zhenyu,LI Jiaxing#br#   

  1. (1.School of Computer Science(School of Cyber Science and Engineering),
    Nanjing University of Information Science & Technology,Nanjing 210044;
    2.School of Software,Nanjing University of Information Science & Technology,Nanjing 210044,China)

  • Received:2024-03-22 Revised:2024-08-05 Online:2026-03-25 Published:2026-03-25

摘要: 针对现有视觉多任务学习模型中网络结构分离和任务间相互干扰的问题,提出了一种基于三重特征嵌入和任务提示融合的端到端多任务学习模型。在图像嵌入编码阶段,通过采用3组不同的编码模块以捕获图像原始的3种特征,充分保留图像的全局、局部以及轮廓特征,丰富嵌入编码向量结构和语义信息,使得模型可以获取不同特征维度的图像信息。在特征提取阶段,为实现端到端统一的任务通用学习、任务特定学习以及跨任务交互,使用空间-通道提示学习模块和提示融合模块提取图像和任务提示的显著特征、趋势以及原始信息,增强任务提示的表达能力和提示能力,更充分地提取图像和任务提示的全局和局部特征。实验结果表明,与单任务SOTA模型相比,mDS以及RMSE指标分别提高了3.36个百分点和2.41个百分点;而与多任务SOTA模型相比,以上2个指标分别提高了1.69个百分点和0.32个百分点,mIOU提高了0.99个百分点,为多任务学习提供了新的解决方法。

关键词: 多任务学习;Transformer架构, 三维目标检测;语义分割;景深估计

Abstract: To address the issues of separated network structures and inter-task interference in existing visual multi-task learning models, an end-to-end visual multi-task learning model based on triple feature embedding and task prompt fusion is proposed. During the image embedding and encoding phase, three distinct encoding modules are employed to capture the original three types of features from the image, fully preserving global, local, and contour features. This enriches the structure and semantic information of the embedding vectors, enabling the model to access image information across different feature dimensions. In the feature extraction phase, to achieve unified end-to-end learning for general tasks, task-specific learning, and cross-task interactions, spatial-channel prompt learning modules and prompt fusion modules are utilized to extract salient features, trends, and raw information from both the image and task prompts. This enhances the expressiveness and guiding capabilities of the task prompts, allowing for more comprehensive extraction of global and local features from both the image and task prompts. Experimental results show that, compared to single-task state-of-the-art (SOTA) models, the evaluation metrics for mDS and RMSE improve by 3.36 percentage  points and 2.41 percentage  points, respectively. Compared to multi-task SOTA models, these metrics improve by 1.69 percentage  points  and 0.32 percentage  points, respectively, with mIOU improving by 0.99 percentage  points. This provides a novel solution for multi-task learning.

Key words: multi-tasking learning, Transformer architecture, 3D object detection, semantic segmentation, depth estimation