• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (01): 84-91.

• 图形与图像 • 上一篇    下一篇

基于残差密集网络层次信息的图像标题生成

王习,张凯,李军辉,孔芳   

  1. (苏州大学计算机科学与技术学院,江苏 苏州 215006)
  • 收稿日期:2020-08-26 修回日期:2020-11-09 接受日期:2022-01-25 出版日期:2022-01-25 发布日期:2022-01-13
  • 基金资助:
    国家自然科学基金(61876120)

Image caption generation based on residual dense hierarchical information

WANG Xi,ZHANG Kai,LI Jun-hui,KONG Fang   

  1. (School of Computer Science and Technology,Soochow University,Suzhou 215006,China)

  • Received:2020-08-26 Revised:2020-11-09 Accepted:2022-01-25 Online:2022-01-25 Published:2022-01-13

摘要: 当前图像标题生成任务的主流方法是基于深层神经网络的方法,尤其是基于自注意力机制模型的方法。然而,传统的深层神经网络层次之间是线性堆叠的,这使得低层网络捕获的信息无法在高层网络中体现,从而没有得到充分的利用。提出基于残差密集网络的方法获取层次语义信息来生成高质量的图像标题。首先,为了能够充分利用网络的层次信息,以及提取深层网络中的各个层的局部特征,提出LayerRDense在层与层之间进行残差密集连接。其次,提出SubRDense,在Decoder端的每层网络中的子层中运用残差密集网络,以更好地融合图像特征和图像的描述信息。在MSCOCO 2014数据集上的实验结果表明,所提出的LayerRDense和SubRDense网络均能进一步提高图像标题生成的性能。 

关键词: 图像标题, 自注意力机制, 残差密集网络


Abstract: The current mainstream method of image caption generation is based on deep neural networks, especially the self-attention mechanism model. However, the traditional deep neural network layers are stacked linearly, which makes the information captured by the low-level network not be able to be reflected in the high-level network and not fully utilized. Therefore, this paper proposes a method based on dense residual network to obtain hierarchical semantic information to generate high-quality image captions. First of all, in order to make full use of the layer information of the network and extract the local features of each layer in the deep network, this paper proposes Layer RDense (Layer Residual Dense), which carries out dense residual connections between layers. Secondly, SubRDense (Sublayer Residual Dense) is proposed. It uses a dense residual network in the sub-layers of each layer of the network at the Decoder side, in order to better integrate image features and image description information. The experimental results based on the MSCOCO 2014 dataset show that the proposed LayerRDense and SubRDense networks can further improve the performance of image caption generation.

Key words: image caption, self-attention mechanism, residual dense network