• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (02): 347-353.

• 图形与图像 • 上一篇    下一篇

基于多尺度特征融合和残差注意力机制的目标检测

李本高,吴从中,许良凤,詹曙   

  1. (合肥工业大学计算机与信息学院,安徽 合肥 231009)
  • 收稿日期:2020-02-21 修回日期:2020-04-27 接受日期:2021-02-25 出版日期:2021-02-25 发布日期:2021-02-23
  • 基金资助:
    国家自然科学基金(61371156)

Object detection based on multi-scale feature fusion and residual attention mechanism

LI Ben-gao,WU Cong-zhong,XU Liang-feng,ZHAN Shu   

  1. (School of Computer and Information,Hefei University of Technology,Hefei 231009,China)
  • Received:2020-02-21 Revised:2020-04-27 Accepted:2021-02-25 Online:2021-02-25 Published:2021-02-23

摘要: 作为一个多任务的学习过程,目标检测相较于分类网络需要更好的特征。基于多尺度特征对不同尺度的目标进行预测的检测器性能已经大大超过了基于单一尺度特征的检测器。同时,特征金字塔结构被用于构建所有尺度的高级语义特征图,从而进一步提高了检测器的性能。但是,这样的特征图没有充分考虑到上下文信息对语义的补充作用。在SSD基准网络的基础上,采用残差注意力的特征融合方法充分利用上下文信息,提高特征图的表征能力,然后利用残差注意力机制强化关键特征。在基准数据集PASCAL VOC上的实验表明,所提方法在输入图像尺寸为300×300和512×512情况下的mAP分别为78.8%和807%。

关键词: 目标检测, 特征融合, 注意力机制, 多尺度特征, 上下文信息

Abstract: As a multi-task learning process, object detection requires better features than classification task. Detectors that predict different scale objects based on multi-scale features have greatly surpassed detectors based on single-scale features. In addition, the feature pyramid structure is used to build advanced semantic feature maps of all scales, thereby further improving the performance of the detector. However, such feature maps do not fully consider the complementary role of contextual information to semantics. Based on the SSD baseline network, a feature fusion method based on residual attention mechanism is used to make full use of the context information. Not only can the high-resolution feature representation capabilities be enhanced by feature fusion, which is more helpful for detecting small-scale objects, but also the attention mechanism is used to strengthen the key features required for prediction. The performance of the model is evaluated on benchmark data set PASCAL VOC, the map of the model with input image sizes of 300 × 300 and 512 × 512 is 78.8% and 80.7%.


Key words: object detection, feature fusion, attention mechanism, multi-scale feature, contextual information