1 Sun Yat-sen University
2 The University of Hong Kong
3 The Hong Kong University of Science and Technology
4 Southern University of Science and Technology
Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. Despite significant progress in recent years, current RVOS models still struggle to handle complex object descriptions, particularly those involving intricate attributes and spatial relationships, due to their limited video-language understanding. To address this limitation, we present ReferDINO, an end-to-end RVOS model that inherits strong vision-language understanding from the pretrained visual grounding foundation models, and is further endowed with effective temporal understanding and object segmentation capabilities. In ReferDINO, we contribute three technical innovations for effectively adapting the foundation models to RVOS: (1) an object-consistent temporal enhancer that capitalizes on the pretrained object-text representations to enhance temporal understanding and object consistency; (2) a grounding-guided deformable mask decoder that integrates text and grounding conditions to generate accurate object masks; (3) a confidence-aware query pruning strategy that significantly improves the object decoding efficiency without compromising performance. Extensive experiments on five public RVOS benchmarks demonstrate that our proposed ReferDINO outperforms state-of-the-art methods significantly.
Modules colored in blue are borrowed from GroundingDINO, while those in red are newly introduced in this work.
@article{liang2025referdino,
title={ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations},
author={Liang, Tianming and Lin, Kun-Yu and Tan, Chaolei and Zhang, Jianguo and Zheng, Wei-Shi and Hu, Jian-Fang},
journal={arXiv preprint arXiv:2501.14607},
year={2025}
}