(Dynamic)
The man is fishing.
(Static) A little boy in a blue shirt on the right.
(Hybrid) A man in black plays with a black-and-red basketball.
(Static) A man wearing a blue shirt, next to a young man wearing a grey-blue coat and glasses.
(Hybrid) A blonde baby plays in a yellow pool on the grass.
(Static)
The car with license plate number 34BLC65.
(Dynamic) The first boy sucessfully crawls up and jumps on the bed.
(Dynamic)
The cat crawls into the bag.
Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions. To advance the task towards more practical scenarios, we introduce Long-RVOS, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions: static, text_dynamic and text_hybrid. Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term text_dynamics and long-term dependencies. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos.
A video is decomposed into clips (keyframe + motion frames). ReferMo perceives the static attributes and short-term motions within each clip, then aggregates inter-clip information capture the global target. Notably, ReferMo is supervised by only keyframe masks, and SAM2 is only used at inference for target tracking in subsequent frames.
@article{liang2025longrvos,
title = {Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation},
author = {Liang, Tianming and Jiang, Haichao and Yang, Yuting and Tan, Chaolei and Li, Shuai and Zheng, Wei-Shi and Hu, Jian-Fang},
journal = {arXiv preprint arXiv:2505.12702},
year = {2025}
}