Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

1Sun Yat-sen University   2Shandong University

Video Examples


Abstract


Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions. To advance the task towards more practical scenarios, we introduce Long-RVOS, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions: static, text_dynamic and text_hybrid. Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term text_dynamics and long-term dependencies. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos.


input

input
  • ☆ Long-term videos: 60.3 seconds & 361.7 frames on average.
  • ☆ Diverse objects: 163 categories.
  • ☆ Explicit description types: Static, Dynamic and Hybrid.
  • ☆ Comprehensive metrics: J&F (spatial), tIoU (temporal), vIoU (spatiotemporal).

ReferMo: A Baseline Approach


input

A video is decomposed into clips (keyframe + motion frames). ReferMo perceives the static attributes and short-term motions within each clip, then aggregates inter-clip information capture the global target. Notably, ReferMo is supervised by only keyframe masks, and SAM2 is only used at inference for target tracking in subsequent frames.

BibTeX

@article{liang2025longrvos,
      title   = {Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation},
      author  = {Liang, Tianming and Jiang, Haichao and Yang, Yuting and Tan, Chaolei and Li, Shuai and Zheng, Wei-Shi and Hu, Jian-Fang},
      journal = {arXiv preprint arXiv:2505.12702},
      year    = {2025}
    }