• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (09): 1602-1610.

• 图形与图像 • 上一篇    下一篇

基于深浅层特征融合的无监督视频摘要算法研究

曾凡锋,王春真,李琛   

  1. (北方工业大学信息学院,北京 100144)
  • 收稿日期:2022-06-09 修回日期:2022-10-18 接受日期:2023-09-25 出版日期:2023-09-25 发布日期:2023-09-12
  • 基金资助:
    北京市属高校青年拔尖人才培育计划(CIT&TCD201904009)

An unsupervised video summarization algorithm based on deep and shallow feature fusion

ZENG Fan-feng,WANG Chun-zhen,LI Chen   

  1. (School of Information,North China University of Technology,Beijing 100144,China)
  • Received:2022-06-09 Revised:2022-10-18 Accepted:2023-09-25 Online:2023-09-25 Published:2023-09-12

摘要: 针对现有无监督视频摘要算法对视频帧重要性判断不准确的问题,提出一种基于深浅层特征融合的无监督视频摘要算法。视频帧的深层特征由卷积神经网络(CNN)进行提取;浅层特征先由加速稳健特征(SURF)算子提取,再使用词袋(BOW)模型进行编码;最后将深层特征与浅层特征进行融合,丰富特征描述符的信息,作为网络模型的输入。使用双向长短期记忆网络(BiLSTM)对时序信息建模并输出帧重要性得分,采用强化学习的方式优化模型。在生成静态视频摘要时,设计了一个基于局部极大值的关键帧筛选方法,遵循了原视频的时序结构同时避免冗余。在SumMe和TVSum数据集上与多个无监督视频摘要算法进行对比,实验结果表明所提算法能够对视频内容做出更准确的判断,并生成了更高质量的摘要。

关键词: 视频摘要, 特征融合, 双向长短期记忆(BiLSTM)网络, 强化学习, 局部极大值

Abstract: To solve the problem that the existing unsupervised video summarization algorithms do not accurately judge the importance of video frames, an unsupervised video summarization algorithm based on deep and shallow feature fusion is proposed. The deep features of video frames are extracted by a Convolutional Neural Network (CNN), while the shallow features are first extracted by the Speeded Up Robust Features (SURF) operator and then encoded using the Bag-of-Words (BOW) model. The deep and shallow features are fused to enrich the information of the feature descriptors as the input of the network model. A Bidirectional Long Short-Term Memory network (BiLSTM) is used to model the temporal information and output frame importance scores. The model is optimized using reinforcement learning. For generating static video summaries, a keyframe selection method based on local maxima is designed, which follows the temporal structure of the original video and avoids redundancy. Compared with several unsupervised video summarization algorithms on the SumMe and TVSum datasets, experimental results show that the proposed algorithm can make more accurate judgments on video content and generate higher-quality summaries.

Key words: video summarization, feature fusion, bi-directional long short-term memory (BiLSTM) network, reinforcement learning, local maximum