• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (09): 1616-1524.

• 图形与图像 • 上一篇    下一篇

多尺度特征融合的移动端单目深度估计研究

陈磊1,梁正友1,2,孙宇1,蔡俊民1   

  1. (1.广西大学计算机与电子信息学院,广西 南宁 530004;
    2.广西多媒体通信与网络技术重点实验室,广西 南宁 530004)


  • 收稿日期:2023-05-10 修回日期:2023-11-21 接受日期:2024-09-25 出版日期:2024-09-25 发布日期:2024-09-23
  • 基金资助:
    国家自然科学基金(62171145)

Mobile monocular depth estimation based on multi-scale feature fusion

CHEN Lei1,LIANG Zheng-you1,2,SUN Yu1,CAI Jun-min1   

  1. (1.School of Computer and Electronics Information,Guangxi University,Nanning 530004;
     2.Guangxi Key Laboratory of Multimedia Communications and Network Technology,Nanning 530004,China)
  • Received:2023-05-10 Revised:2023-11-21 Accepted:2024-09-25 Online:2024-09-25 Published:2024-09-23

摘要: 目前基于深度学习的深度估计模型参数量大,难以适应移动端设备。针对此问题,提出一种可以部署在移动端的多尺度特征融合轻量级深度估计方法。首先,以MobileNetV2为主干,提取出4个尺度的特征。然后,通过构建编码器到解码器的跳跃连接路径,将4个尺度的特征进行融合,充分利用融合低层的位置信息和高层的语义信息。最后,融合后的特征通过卷积层得出高精度的深度图像。在NYU Depth Dataset V2数据集上进行了训练和测试,结果表明,该模型的参数量在仅有1.6×106的情况下,评估指标δ1高达0.812,在移动端的麒麟980 CPU上推理一幅图像仅需要0.094 s,具有实际应用价值。

关键词: 深度学习, 深度估计;多尺度特征;轻量级网络;移动端模型

Abstract: The current depth estimation model based on depth learning has a large number of parameters, which is difficult to adapt to mobile devices. To address this issue, a lightweight depth estimation method with multi-scale feature fusion that can be deployed on mobile devices is proposed. Firstly, MobileNetV2 is used as the backbone to extract features of four scales. Then, by constructing skip connection paths from the encoder to the decoder, the features of the four scales are fused, fully utilizing the combined positional information from lower layers and semantic information from higher layers. Finally, the fused features are processed through convolutional layers to produce high-precision depth images. After training and testing on  NYU Depth Dataset V2, the experimental results show that the proposed model achieves advanced performance with an evaluation index of δ1 up to 0.812 while only having 1.6×106 parameters numbers. Additionally, it only takes 0.094 seconds to infer a single image on the Kirin 980 CPU of a mobile device, demonstrating its practical application value.

Key words: deep learning, depth estimation, multi-scale feature, lightweight network, mobile terminal model