• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (11): 2019-2028.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于时空融合的多视角3D感知网络设计

李贺,陈品同,余荣,谭北海   

  1. (1.广东工业大学自动化学院,广东 广州 510006;2.华南理工大学未来技术学院,广东 广州 510641;
    3.广东工业大学集成电路学院,广东 广州 510006)

  • 收稿日期:2024-02-05 修回日期:2024-10-13 出版日期:2025-11-25 发布日期:2025-12-08
  • 基金资助:
    国家自然科学基金(U22A2054);国家重点研发计划(2020YFB1807802)

A multi-view 3D perception network based on spatio-temporal fusion

LI He,CHEN Pintong,YU Rong,TAN Beihai   

  1.  (1.School of Automation,Guangdong University of Technology,Guangzhou 510006;
    2.School of Future Technology, South China University of Technology,Guangzhou 510641;
    3.School of Integrated Circuits, Guangdong University of Technology,Guangzhou 510006,China) 
  • Received:2024-02-05 Revised:2024-10-13 Online:2025-11-25 Published:2025-12-08

摘要: 感知系统作为自动驾驶的关键组成部分,直接影响车辆对周围环境的理解,是实现安全可靠自动驾驶的基础。相比传统2D图像检测感知技术提供的有限信息,3D感知可以提供更丰富的感知数据,但也存在空间信息融合不充分与时序信息利用不足的关键问题。提出一种融合时空信息的多视角3D感知网络,该网络包括多视角环视3D感知网络与时空融合网络MVSPNet。多视角环视感知网络可以通过精确的空间视角转换,高效地融合多相机图像数据,以构建统一鸟瞰图空间表征,实现多个相机数据的空间对齐和融合。相较于当前先进的单目基准模型FCOS3D,所提出网络的全类平均精度mAP达到了0.343,相对提升了14.7%。时空融合网络MVSPNet可以实现时序上多视角的图像融合,融合多帧数据,进一步显著提升了网络性能,融合2帧时序数据,mAP进一步提升了10.2%。实验结果充分证明了所设计的网络在有效融合多视角空间信息与时序信息方面的先进性,为提升自动驾驶系统在动态复杂场景下的3D感知提供了有效的解决方案,对推动安全、可靠的自动驾驶技术发展具有重要意义。


关键词: 自动驾驶感知, 多视角感知, 目标检测, 3D感知

Abstract: As a critical component of autonomous driving, the perception system directly influences a vehicle’s comprehension of its surrounding environment and serves as the foundation for achieving safe and reliable autonomous driving. Traditional 2D image detection perception techniques can only provide limited information. While 3D perception offers richer perceptual data,  it faces key challenges, including insufficient fusion of spatial information and inadequate utilization of temporal information. This paper proposes and designs a multi-view 3D perception network that integrates spatio-temporal information. This network comprises a multi-view surround 3D perception network and a spatio-temporal fusion network MVSPNet. The multi-view surround perception network efficiently fuses multi-camera image data through precise spatial perspective transformation, constructing a unified bird’s eye view spatial representation. This achieves spatial alignment and fusion of data from multiple cameras. Compared to the current advanced monocular baseline model FCOS3D, it achieves a mean average precision (mAP) of 0.343, representing a performance improvement of 14.7%. The spatio-temporal fusion network MVSPNet  enables the temporal fusion of multi-view images by integrating multi-frame data. This further significantly enhances the network’s performance, and fusing 2 frames of temporal data results in an additional mAP improvement of 10.2%. The experiment results fully demonstrate the advancement of the designed network in effectively fusing multi-view spatial information and temporal information. This study provides an effective solution for enhancing 3D perception of autonomous driving systems in dynamic and complex scenarios, holding significant implications for advancing the development of safe and reliable autonomous driving technology. 


Key words: autonomous driving perception, multi-view perception, object detection, 3D perception