ReferDINO

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Tianming Liang¹ Kun-Yu Lin¹ Chaolei Tan¹ Jianguo Zhang² Wei-Shi Zheng¹ Jian-Fang Hu^{* 1}

¹ Sun Yat-sen University ² Southern University of Science and Technology

Abstract

Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep vision-language understanding, pixel-level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose ReferDINO, a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception and cross-modal spatiotemporal reasoning. Experimental results on five benchmarks demonstrate that our ReferDINO significantly outperforms previous methods with real-time inference speed.

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Demo

Abstract

ReferDINO

Results

Citation