• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2026, Vol. 48 ›› Issue (3): 456-466.

• Graphics and Images • Previous Articles     Next Articles

An end-to-end visual multi-task learning model for task prompts fusion

GENG Huantong,FAN Zichen,JIANG Jun,LIU Zhenyu,LI Jiaxing#br#   

  1. (1.School of Computer Science(School of Cyber Science and Engineering),
    Nanjing University of Information Science & Technology,Nanjing 210044;
    2.School of Software,Nanjing University of Information Science & Technology,Nanjing 210044,China)

  • Received:2024-03-22 Revised:2024-08-05 Online:2026-03-25 Published:2026-03-25

Abstract: To address the issues of separated network structures and inter-task interference in existing visual multi-task learning models, an end-to-end visual multi-task learning model based on triple feature embedding and task prompt fusion is proposed. During the image embedding and encoding phase, three distinct encoding modules are employed to capture the original three types of features from the image, fully preserving global, local, and contour features. This enriches the structure and semantic information of the embedding vectors, enabling the model to access image information across different feature dimensions. In the feature extraction phase, to achieve unified end-to-end learning for general tasks, task-specific learning, and cross-task interactions, spatial-channel prompt learning modules and prompt fusion modules are utilized to extract salient features, trends, and raw information from both the image and task prompts. This enhances the expressiveness and guiding capabilities of the task prompts, allowing for more comprehensive extraction of global and local features from both the image and task prompts. Experimental results show that, compared to single-task state-of-the-art (SOTA) models, the evaluation metrics for mDS and RMSE improve by 3.36 percentage  points and 2.41 percentage  points, respectively. Compared to multi-task SOTA models, these metrics improve by 1.69 percentage  points  and 0.32 percentage  points, respectively, with mIOU improving by 0.99 percentage  points. This provides a novel solution for multi-task learning.

Key words: multi-tasking learning, Transformer architecture, 3D object detection, semantic segmentation, depth estimation