OmniDexGrasp

Interactive Visualization

Initial Scene Image

Generated Human Grasp

Real World Grasping

Initial scene image for 3D Printed Object

Initial scene image for Mineral Water Bottle

Generated grasp for Mineral Water Bottle

Abstract

In this work, we introduce OmniDexGrasp, a unified framework that achieves generalizable dexterous grasping solely guided by grasp demonstrations generated from foundation generative models. Without relying on robot data or additional training, OmniDexGrasp realizes omni-ability in functional grasping—covering six representative tasks, including semantic grasping, region/point grasping, grasping in cluttered scenes, one-shot demonstration grasping, human–robot handover, and fragile object grasping—while supporting multi-modal inputs such as language, visual prompts, and demonstration images.

Unlike traditional methods that train a dedicated network to predict grasp poses, OmniDexGrasp leverages both a foundation generative model (e.g., GPT-Image) and a foundation visual model to synthesize human grasp images and convert them into executable dexterous robot actions. The framework integrates a human-image-to-robot-action transfer strategy that reconstructs and retargets generated human grasps to robot joint configurations, together with a force-sensing adaptive grasping strategy that ensures stable and reliable execution. Moreover, our framework is naturally extensible to manipulation tasks.

Extensive experiments in both simulation and real-world settings demonstrate that foundation models can provide precise and semantically aligned guidance for dexterous grasping, achieving an average 87.9% success rate across six diverse tasks. As foundation models continue to advance, we envision future work extending OmniDexGrasp toward non-prehensile manipulation, further promoting the integration of foundation models into embodied intelligence.

Framework

(a) Using a foundation generative model, a human grasp image is generated based on the given grasp instruction and the initial scene image. (b) Relying solely on foundation visual models, the human-image-to-robot-action transfer module reconstructs the 3D hand–object interaction from the generated grasp image, retargets the human grasp to the robot’s dexterous hand, and aligns the grasp with the real-world object 6D pose to obtain an executable dexterous grasp action. (c) A force-sensing adaptive grasping strategy executes the grasp by dynamically adjusting finger motions according to force feedback, ensuring stable and reliable grasp execution.

BibTeX


          @article{wei2025omnidexgrasp,
            title={OmniDexGrasp: Generalizable Dexterous Grasping via Foundation Model and Force Feedback},
            author={Yi-Lin Wei and Zhexi Luo and Yuhao Lin and Mu Lin and Zhizhao Liang and Shuoyu Chen and Wei-Shi Zheng},
            journal={arXiv preprint arXiv:2510.23119},
            year={2025},
          }

OmniDexGrasp: Generalizable Dexterous Grasping via Foundation Model and Force Feedback

Overview

OmniDexGrasp can achieve generalizable dexterous grasping with omni capabilities in user prompting, dexterous embodiment, scenes, and grasping tasks, by leveraging (1) foundation generative model, (2) human-image-to-robot-action transfer strategy, (3) force-aware adaptive grasp strategy.