OmniDexGrasp: Generalizable Dexterous Grasping via Foundation Model and Force Feedback

Yi-Lin Wei, Zhexi Luo, Yuhao Lin, Mu Lin, Zhizhao Liang, Shuoyu Chen, Wei-Shi Zheng*
Sun Yat-sen University
Equal contribution, *Corresponding author

Overview


OmniDexGrasp can achieve generalizable dexterous grasping with omni capabilities
in user prompting, dexterous embodiment, scenes, and grasping tasks, by leveraging
(1) foundation model, (2) human-image-to-robot-action transfer, (3) force-aware adaptive grasp.

OmniDexGrasp teaser image

Visualization

Interactive Visualization

Generated Human Grasp

Real World Grasping

Abstract


In this work, we introduce OmniDexGrasp, a unified framework that achieves generalizable dexterous grasping solely guided by grasp demonstrations generated from foundation generative models. Without relying on robot data or additional training, OmniDexGrasp realizes omni-ability in functional grasping—covering six representative tasks, including semantic grasping, region/point grasping, grasping in cluttered scenes, one-shot demonstration grasping, human–robot handover, and fragile object grasping—while supporting multi-modal inputs such as language, visual prompts, and demonstration images.

Unlike traditional methods that train a dedicated network to predict grasp poses, OmniDexGrasp leverages both a foundation generative model (e.g., GPT-Image) and a foundation visual model to synthesize human grasp images and convert them into executable dexterous robot actions. The framework integrates a human-image-to-robot-action transfer strategy that reconstructs and retargets generated human grasps to robot joint configurations, together with a force-sensing adaptive grasping strategy that ensures stable and reliable execution. Moreover, our framework is naturally extensible to manipulation tasks.

Extensive experiments in both simulation and real-world settings demonstrate that foundation models can provide precise and semantically aligned guidance for dexterous grasping, achieving an average 87.9% success rate across six diverse tasks. As foundation models continue to advance, we envision future work extending OmniDexGrasp toward non-prehensile manipulation, further promoting the integration of foundation models into embodied intelligence.

Framework

Framework of OmniDexGrasp

(a) Using a foundation generative model, a human grasp image is generated based on the given grasp instruction and the initial scene image. (b) Relying solely on foundation visual models, the human-image-to-robot-action transfer module reconstructs the 3D hand–object interaction from the generated grasp image, retargets the human grasp to the robot’s dexterous hand, and aligns the grasp with the real-world object 6D pose to obtain an executable dexterous grasp action. (c) A force-sensing adaptive grasping strategy executes the grasp by dynamically adjusting finger motions according to force feedback, ensuring stable and reliable grasp execution.

BibTeX

@article{wei2025omnidexgrasp,
  author    = {Yi-Lin Wei and Zhexi Luo and Yuhao Lin and Mu Lin and Zhizhao Liang and Shuoyu Chen and Wei-Shi Zheng},
  title     = {OmniDexGrasp: Generalizable Dexterous Grasping via Foundation Model and Force Feedback},
  journal   = {arXiv},
  year      = {2025},
}

The source code for this website is adapted from the template provided by nerfies.github.io.