• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (02): 298-307.

• 图形与图像 • 上一篇    下一篇

聚焦式学习分割一切提示的无监督视频目标分割

沈勇辉1,2,3,卜东旭1,2,3,张胜裕1,2,3,宋慧慧1,2,3   

  1. (1.南京信息工程大学自动化学院,江苏 南京 210044;2.江苏省大数据分析技术重点实验室,江苏 南京 210044;
    3.大气环境与装备技术协同创新中心,江苏 南京 210044)

  • 收稿日期:2023-11-01 修回日期:2024-01-03 接受日期:2025-02-25 出版日期:2025-02-25 发布日期:2025-02-24
  • 基金资助:
    国家自然科学基金(61872189)

Focusing paradigm prompt learning of segment anything for unsupervised video object segmentation

SHEN Yonghui1,2,3,BU Dongxu1,2,3,ZHANG Shengyu1,2,3,SONG Huihui1,2,3   

  1. (1.School of Automation,Nanjing University of Information Science & Technology,Nanjing 210044;
    2.Jiangsu Key Laboratory of Big Data Analysis Technology,Nanjing 210044;
    3.Collaborative Innovation Center on Atmospheric Environment and Equipment Technology,Nanjing 210044,China)Abstract:Unsupervised video object segmentation aims to automatically locate and segment the primary objects in video frames during the testing phase. Currently, most models rely on appearance cues extracted from RGB images and motion cues extracted from optical flow maps for object segmentation. However, issues such as object occlusion, rapid motion, or stillness can lead to missing information in optical flow, making it difficult to achieve good segmentation results solely based on the limited information obtained from the appearance branch. To address this problem, this paper proposes a focused learning network (FPLNet), which introduces an additional dual-branch structure to capture the positional and contour information of the main objects, thus compensating for the missing optical flow information. Firstly, the proposed model utilizes the backbone network of the segment anything model (SAM) to extract appearance and motion information, thereby improving the model's generalization. Then, the two additional segmentation branches, coarse-grained and fine-grained, are introduced as the cue part of the focus learning network. In the decoding part, RGB appearance information, optical flow motion information, coarse-grained features, and fine-grained features are progressively fused to mimic the process of focused learning of target features in the human visual system. Extensive testing is conducted on three standard datasets, and the experimental results demonstrate that the proposed model achieves superior performance compared to existing models.
  • Received:2023-11-01 Revised:2024-01-03 Accepted:2025-02-25 Online:2025-02-25 Published:2025-02-24

摘要: 无监督视频目标分割旨在测试阶段自动定位和分割视频帧中的主要目标。目前,大多数模型、方法依赖于从RGB图提取的外观线索和从光流图提取的运动线索来进行目标分割。然而,目标遮挡、快速运动或静止等问题会导致光流获取的信息缺失,仅依靠外观分支获取的有限信息难以实现良好的分割效果。为了解决这一问题,提出了一种聚焦式学习网络模型FPLNet,该模型引入额外的双分支结构以捕捉主要目标的位置信息和轮廓信息,从而弥补光流信息的缺失。首先,所提出的模型利用分割一切模型SAM的骨干网络提取外观和运动信息,从而提高模型的泛化性。然后,将额外引入的粗粒度和细粒度的2个分割分支共同作为聚焦式学习网络的提示部分。在解码部分,RGB外观信息、光流运动信息、粗粒度特征和细粒度特征逐步融合,以此模仿人类视觉系统,实现聚焦式学习目标特征的过程。在3个标准数据集上进行了大量的测试,实验结果表明,与现有的模型相比,所提出的模型拥有更优异的性能。

关键词: 无监督视频目标分割, 聚焦式学习, 分割一切模型

Abstract: unsupervised video object segmentation;focusing learning;segment anything model