• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (11): 2019-2028.

• Artificial Intelligence and Data Mining • Previous Articles     Next Articles

A multi-view 3D perception network based on spatio-temporal fusion

LI He,CHEN Pintong,YU Rong,TAN Beihai   

  1.  (1.School of Automation,Guangdong University of Technology,Guangzhou 510006;
    2.School of Future Technology, South China University of Technology,Guangzhou 510641;
    3.School of Integrated Circuits, Guangdong University of Technology,Guangzhou 510006,China) 
  • Received:2024-02-05 Revised:2024-10-13 Online:2025-11-25 Published:2025-12-08

Abstract: As a critical component of autonomous driving, the perception system directly influences a vehicle’s comprehension of its surrounding environment and serves as the foundation for achieving safe and reliable autonomous driving. Traditional 2D image detection perception techniques can only provide limited information. While 3D perception offers richer perceptual data,  it faces key challenges, including insufficient fusion of spatial information and inadequate utilization of temporal information. This paper proposes and designs a multi-view 3D perception network that integrates spatio-temporal information. This network comprises a multi-view surround 3D perception network and a spatio-temporal fusion network MVSPNet. The multi-view surround perception network efficiently fuses multi-camera image data through precise spatial perspective transformation, constructing a unified bird’s eye view spatial representation. This achieves spatial alignment and fusion of data from multiple cameras. Compared to the current advanced monocular baseline model FCOS3D, it achieves a mean average precision (mAP) of 0.343, representing a performance improvement of 14.7%. The spatio-temporal fusion network MVSPNet  enables the temporal fusion of multi-view images by integrating multi-frame data. This further significantly enhances the network’s performance, and fusing 2 frames of temporal data results in an additional mAP improvement of 10.2%. The experiment results fully demonstrate the advancement of the designed network in effectively fusing multi-view spatial information and temporal information. This study provides an effective solution for enhancing 3D perception of autonomous driving systems in dynamic and complex scenarios, holding significant implications for advancing the development of safe and reliable autonomous driving technology. 


Key words: autonomous driving perception, multi-view perception, object detection, 3D perception