Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

Abstract

Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions. To advance the task towards more practical scenarios, we introduce Long-RVOS, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions: static, text_dynamic and text_hybrid. Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term text_dynamics and long-term dependencies. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos.

☆ Long-term videos: 60.3 seconds & 361.7 frames on average.
☆ Diverse objects: 163 categories.
☆ Explicit description types: Static, Dynamic and Hybrid.
☆ Comprehensive metrics: J&F (spatial), tIoU (temporal), vIoU (spatiotemporal).

ReferMo: A Baseline Approach

A video is decomposed into clips (keyframe + motion frames). ReferMo perceives the static attributes and short-term motions within each clip, then aggregates inter-clip information capture the global target. Notably, ReferMo is supervised by only keyframe masks, and SAM2 is only used at inference for target tracking in subsequent frames.

BibTeX

@article{liang2025longrvos,
      title   = {Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation},
      author  = {Liang, Tianming and Jiang, Haichao and Yang, Yuting and Tan, Chaolei and Li, Shuai and Zheng, Wei-Shi and Hu, Jian-Fang},
      journal = {arXiv preprint arXiv:2505.12702},
      year    = {2025}
    }

Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

Video Examples

Abstract

ReferMo: A Baseline Approach

BibTeX