DGTR

Dexterous Grasp Transformer

CVPR 2024

Guo-Hao Xu^{* 1} Yi-Lin Wei^{* 1} Dian Zheng¹ Xiao-Ming Wu¹ Wei-Shi Zheng^{† 1, 2}

¹ School of Computer Science and Engineering, Sun Yat-sen University, China
² Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China

^* equal contribution ^† corresponding author

Paper

Code

Motivation

Conditional generative models may consistently generate nearly identical outputs (given the same input) at inference time due to the powerful condition, except for a diffusion-based model, which can generate diverse grasps but with low quality.

Alternatively, vanilla discriminative models can only predict a single grasp pose for one input object.

We formulate dexterous grasp generation as a set prediction task and design a transformer-based grasping model inspired by the impressive success of Detection Transformers.

However, we identify that this set prediction paradigm encounters several optimization challenges in the field of dexterous grasping.

DGTR

The input of DGTR is the complete point cloud $\mathcal{O}$ of an object. First, the PointNet++ encoder downsamples the point cloud and extracts a set of object features. Next, the transformer decoder takes $N$ learnable query embeddings as well as the object features as input and predicts $N$ diverse grasp poses in parallel. In the dynamic matching training stage, our model is trained with the matching result produced by Hungarian Algorithm and without object penetration loss. In the static matching training stage, we use static matching recorded in the DMT stage to train the model with object penetration loss. At test time, we adopt an adversarial-balanced loss to directly finetune the hand pose parameters.

DSMT

DMT: The matching between the predictions and the targets are dynamically generated by the Hungarian Algorithm. The object penetration loss is excluded.

SMW: The stable matching results recorded in the DMT stage are used. The object penetration loss is excluded.

SMPT: The stable matching results are still used and the object penetration loss and the hand-object distance loss are incorporated.

AB-TTA

Translation Moderation: Downscale the gradient of the global translation with $\beta_{t}$

TTA-distance loss: $\mathcal{L}_{tta-dist} = \sum_{i}{\mathbb{I}((d(p^{c}_{i}) < \tau) \lor (d(p^{r}_{i}) < \tau)) * d(p^{r}_{i})}$

Dexterous Grasp Transformer

CVPR 2024

Motivation

DGTR

DSMT

AB-TTA

Qualitative results

Citation

Contact