An end-to-end visual multi-task learning model for task prompts fusion

Computer Engineering & Science ›› 2026, Vol. 48 ›› Issue (3): 456-466.

• Graphics and Images • Previous Articles Next Articles

An end-to-end visual multi-task learning model for task prompts fusion

GENG Huantong,FAN Zichen,JIANG Jun,LIU Zhenyu,LI Jiaxing#br#

(1.School of Computer Science(School of Cyber Science and Engineering),
Nanjing University of Information Science & Technology,Nanjing 210044;
2.School of Software,Nanjing University of Information Science & Technology,Nanjing 210044,China)

Received:2024-03-22 Revised:2024-08-05 Online:2026-03-25 Published:2026-03-25

Abstract

Abstract: To address the issues of separated network structures and inter-task interference in existing visual multi-task learning models, an end-to-end visual multi-task learning model based on triple feature embedding and task prompt fusion is proposed. During the image embedding and encoding phase, three distinct encoding modules are employed to capture the original three types of features from the image, fully preserving global, local, and contour features. This enriches the structure and semantic information of the embedding vectors, enabling the model to access image information across different feature dimensions. In the feature extraction phase, to achieve unified end-to-end learning for general tasks, task-specific learning, and cross-task interactions, spatial-channel prompt learning modules and prompt fusion modules are utilized to extract salient features, trends, and raw information from both the image and task prompts. This enhances the expressiveness and guiding capabilities of the task prompts, allowing for more comprehensive extraction of global and local features from both the image and task prompts. Experimental results show that, compared to single-task state-of-the-art (SOTA) models, the evaluation metrics for mDS and RMSE improve by 3.36 percentage points and 2.41 percentage points, respectively. Compared to multi-task SOTA models, these metrics improve by 1.69 percentage points and 0.32 percentage points, respectively, with mIOU improving by 0.99 percentage points. This provides a novel solution for multi-task learning.

Key words: multi-tasking learning, Transformer architecture, 3D object detection, semantic segmentation, depth estimation

GENG Huantong, FAN Zichen, JIANG Jun, LIU Zhenyu, LI Jiaxing. An end-to-end visual multi-task learning model for task prompts fusion[J]. Computer Engineering & Science, 2026, 48(3): 456-466.

[1]	YANG Mei, LIU Sinan, PAN Zhen, GAO Lei, MIN Fan. Edge and semantic collaborative dual-branch decoding network for agricultural parcel extraction [J]. Computer Engineering & Science, 2026, 48(3): 444-455.
[2]	LI Yan, FAN Xinyu, CHEN Qin. A multi-path and multi-scale attention network for land cover segmentation [J]. Computer Engineering & Science, 2026, 48(1): 108-118.
[3]	XU Mengfan, HUANG Wei, GU Zhuoming. A multi-level adversarial mean teacher network for semantic segmentation of nighttime urban landscape [J]. Computer Engineering & Science, 2025, 47(12): 2195-2203.
[4]	MA Dong-mei, WANG Peng-yu, GUO Zhi-hao. A lightweight semantic segmentation based on attention mechanism [J]. Computer Engineering & Science, 2024, 46(8): 1503-1512.
[5]	XU Xin, LI Ruo-shi, YUAN Ye, LIU Na. Semantic segmentation of foggy driving scenes based on learnable image filter [J]. Computer Engineering & Science, 2024, 46(11): 2027-2034.
[6]	QIU Xiao-meng, WANG Lin, GU Wen-jun, SONG Wei, TIAN Hao-lai, HU Yu. A time series image semantic segmentation model modified by optical flow [J]. Computer Engineering & Science, 2024, 46(1): 102-110.
[7]	SHE Xiang-yang, MA Yi-jun. An improved semantic segmentation algorithm for remote sensing images [J]. Computer Engineering & Science, 2023, 45(3): 504-511.
[8]	MA Dong-mei, HUANG Xin-yue, LI Yu. Image semantic segmentation based on feature fusion and attention mechanism [J]. Computer Engineering & Science, 2023, 45(3): 495-503.
[9]	LIU Rong, WU Xin, AO Bin, WEN Qing, LI Kuan. Cell annotation refinement and adaptive weighted loss for CD56 image segmentation [J]. Computer Engineering & Science, 2022, 44(5): 870-878.
[10]	LIU Li-man, TAN Long-yu, PENG Yuan, LIU Jia. Semantic segmentation of 3D point cloud based on all fusion network [J]. Computer Engineering & Science, 2022, 44(5): 862-869.
[11]	MA Dong-mei, LI Peng-hui, HUANG Xin-yue, ZHANG Qian, YANG Xin. Efficient semantic segmentation based on improved DeepLabV3+ [J]. Computer Engineering & Science, 2022, 44(4): 737-745.
[12]	WANG Yang, YU Zhen-xin, LU Jia, . An image saliency area style transfer method combining visual attention mechanism [J]. Computer Engineering & Science, 2022, 44(1): 118-123.
[13]	XU Shi-jie, DU Yu, LU Xin, WU Si-fan. A lightweight semantic segmentation algorithm based on ENet [J]. Computer Engineering & Science, 2021, 43(8): 1454-1460.
[14]	ZHOU Fei,TANG Jian,YANG Cheng-song,RUI Ting. Road semantic segmentation based on hybrid auto-encoder [J]. Computer Engineering & Science, 2019, 41(8): 1453-1458.
[15]	LI Xiying1,2,3,4,ZHOU Zhihao1,2,3,4,L Shuo1,2,3,4. A vehicle face detection algorithm based on selective search [J]. Computer Engineering & Science, 2018, 40(10): 1829-1836.

An end-to-end visual multi-task learning model for task prompts fusion

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 15

Recommended Articles

Metrics

Comments