AffordDexGrasp: Open-set Language-guided Dexterous Grasp
with Generalizable-Instructive Affordance


Yi-Lin Wei* 1    Mu Lin* 1   , Yuhao Lin 1   , Jian-Jian Jiang 1    Xiao-Ming Wu 1    Liang-An Zeng 2    Wei-Shi Zheng1 †   

1 Sun Yat-sen University, China    

corresponding author  



Abstract


Language-guided robot dexterous generation enables robots to grasp and manipulate objects based on human commands. However, previous data-driven methods are hard to understand intention and execute grasping with unseen categories in the open set. In this work, we explore a new task, Open-set Language-guided Dexterous Grasp, and find that the main challenge is the huge gap between high-level human language semantics and low-level robot actions. To solve this problem, we propose an Affordance Dexterous Grasp (AffordDexGrasp) framework, with the insight of bridging the gap with a new generalizable-instructive affordance representation. This affordance can generalize to unseen categories by leveraging the object's local structure and category-agnostic semantic attributes, thereby effectively guiding dexterous grasp generation. Built upon the affordance, our framework introduces Affordacne Flow Matching (AFM) for affordance generation with language as input, and Grasp Flow Matching (GFM) for generating dexterous grasp with affordance as input. To evaluate our framework, we build an open-set table-top language-guided dexterous grasp dataset. Extensive experiments in the simulation and real worlds show that our framework surpasses all previous methods in open-set generalization.

input


DexGYSGrasp Framework


The pipeline of Affordance Dexterous Grasp framework. The inference pipeline includes three stages: 1) intention pre-understanding assisted by MLLM; 2) affordance flow matching for generating affordance base on MLLM ouput; 3) Grasp Flow Matching and Optimization for outputing grasp poses based on the affordance and MLLM outputs. In the training time, AFM and GFM are independently trained one after another. Transformer and Perceiver are attention-based interaction module for velocity vector field prediction.

input

Real World Experiment


Real World Experiment Setup

Experiment Setup

The real-word experiments are conducted to verify the simulation-to-reality ability of our framework. We employ a Leap Hand, a Kinova Gen3 6DOF arms and an original wrist RGB-D camera of Kinova arm. In experiment, we synthesize the scene point cloud by taking several partial depth maps around the object. Then the scene point cloud, a RGB image and the user language command are fed into our framework to obtain the dexterous grasp pose. During execution, we first move the the arm to a pre-grasp position, then synchronously move the joints of the robotic arm and the dexterous hand to reach the target pose.

Experiment Visualization

input

Contact


If you have any questions, please feel free to contact us:

  • Yi-Lin Wei: weiylin5Prevent spamming@Prevent spammingmail2.sysu.edu.cn