icon

ReViP: Mitigating False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance

Zhuohao Li1,2, Yinghao Li1,2, Jian-Jian Jiang1, Lang Zhou1,2, Tianyu Zhang3,2, Jiadong Yin1, Mu Lin1, Yi-Lin Wei1, Wei-Shi Zheng1,4,5,6,†
1Sun Yat-sen University, 2Shenzhen Loop Area Institute, 3Beijing Institute of Technology, 4Peng Cheng Laboratory, 5Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, 6Guangdong Province Key Laboratory of Information Security Technology
†Corresponding author
Contact: lizhh268@mail2.sysu.edu.cn

ReViP identifies false completion as a critical failure mode in VLA models and diagnoses it, benchmarks it (the first False-Completion benchmark suite), and then mitigates it with rebalancing vision and proprioception in both simulation and real-world settings.

Abstract

Vision-Language-Action (VLA) models have advanced robotic manipulation by combining vision, language, and proprioception to predict actions. However, previous methods fuse proprioceptive signals directly with vision-language features, resulting in state-dominant bias and false completion despite visible execution failures. We systematically analyze this failure mode, attributing it to modality imbalance, where policies overly rely on internal state progression and underuse visual evidence.

To address this, we introduce the first False-Completion Benchmark Suite, featuring eight tasks with three controlled perturbations (Object Drop, Distractor Swap, Relayout) to comprehensively evaluate false completion.

Moreover, we propose ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations. The key insight is to introduce auxiliary progress-aware visual cues to adaptively modulate the coupling between semantic perception and proprioceptive dynamics. Specifically, progress-aware visual cues are extracted by an external Task-Stage Observer, which performs task-relevant reasoning on real-time observations to drive task-stage feature-wise linear modulation, enhancing environmental awareness and mitigating state-driven errors.

Extensive experiments show that ReViP effectively mitigates false completion and improves success rates over strong VLA baselines, achieving a 26% gain over π0 model on our suite, with gains extending to LIBERO, RoboTwin 2.0, and real-world evaluations.

Overview of ReViP

ReViP

Motivation: "False Completion" in VLA models


ReViP highlights a failure mode that existing VLA evaluation often misses: a robot can look "done" internally while the world clearly shows that the task has not been completed. (The Three visualizations below are the results generated by the π0 model.)

✕No Back to Oject
  Butter->Baket

✕No Back to Oject
  Orang Juice->Baket

✕No Back to Oject
  Cream Cheese->Baket

False-Completion Benchmark Suite

To our knowledge, this is the first benchmark suite built specifically for false completion in VLA models, covering 8 tasks and 3 controlled perturbation families.

False-Completion Benchmark

Object Drop

Tests whether a model can detect unexpected execution-time failure, recover the object, and regrasp instead of blindly continuing the original plan.

Distractor Swap

Stresses instance-level visual grounding by swapping target and distractor poses while keeping language fixed.

Relayout

Breaks demonstration-specific spatial priors and requires policies to adapt to new target-goal layouts from current visual observations.

Method

ReViP introduces a Task-Stage Observer which performs task-relevant reasoning to extract progress-aware visual cues and a Task-Stage Enhancer to inject them back into the VLA backbone, adaptively rebalancing visual grounding and proprioceptive dynamics.

ReViP method overview

Simulation Benchmark Visualization


Object Drop
  Butter->Baket

Distractor Swap
  Butter->Baket

Object Drop
  Salad dressing->Baket

Real-World Task Visualization


Object Drop

Similar Distractor

Small-Object Distractor

Long-Horizon Task

Experiments

ReViP results

BibTeX


      @misc{li2026revip,
      title={ReViP: Mitigating False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance}, 
      author={Zhuohao Li and Yinghao Li and Jian-Jian Jiang and Lang Zhou and Tianyu Zhang and Jiadong Yin and Mu Lin and Yi-Lin Wei and Wei-Shi Zheng},
      year={2026},
      eprint={2601.16667},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.16667}, 
    }