Dexterous Grasp Transformer

CVPR 2024


Guo-Hao Xu * 1    Yi-Lin Wei * 1     Dian Zheng1     Xiao-Ming Wu1    Wei-Shi Zheng † 1, 2

1 School of Computer Science and Engineering, Sun Yat-sen University, China    
2 Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China   

* equal contribution   corresponding author  


Motivation


input

Conditional generative models may consistently generate nearly identical outputs (given the same input) at inference time due to the powerful condition, except for a diffusion-based model, which can generate diverse grasps but with low quality.

Alternatively, vanilla discriminative models can only predict a single grasp pose for one input object.

We formulate dexterous grasp generation as a set prediction task and design a transformer-based grasping model inspired by the impressive success of Detection Transformers.

However, we identify that this set prediction paradigm encounters several optimization challenges in the field of dexterous grasping.



DGTR


input

The input of DGTR is the complete point cloud $\mathcal{O}$ of an object. First, the PointNet++ encoder downsamples the point cloud and extracts a set of object features. Next, the transformer decoder takes $N$ learnable query embeddings as well as the object features as input and predicts $N$ diverse grasp poses in parallel. In the dynamic matching training stage, our model is trained with the matching result produced by Hungarian Algorithm and without object penetration loss. In the static matching training stage, we use static matching recorded in the DMT stage to train the model with object penetration loss. At test time, we adopt an adversarial-balanced loss to directly finetune the hand pose parameters.


DSMT


DMT: The matching between the predictions and the targets are dynamically generated by the Hungarian Algorithm. The object penetration loss is excluded.

SMW: The stable matching results recorded in the DMT stage are used. The object penetration loss is excluded.

SMPT: The stable matching results are still used and the object penetration loss and the hand-object distance loss are incorporated.


AB-TTA


Translation Moderation: Downscale the gradient of the global translation with $\beta_{t}$

TTA-distance loss: $\mathcal{L}_{tta-dist} = \sum_{i}{\mathbb{I}((d(p^{c}_{i}) < \tau) \lor (d(p^{r}_{i}) < \tau)) * d(p^{r}_{i})}$


Qualitative results


input

Citation



@inproceedings{xu2024dexterous,
  title={Dexterous Grasp Transformer},
  author={Xu, Guo-Hao and Wei, Yi-Lin and Zheng, Dian and Wu, Xiao-Ming and Zheng, Wei-Shi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}


Contact


If you have any questions, please feel free to contact us:

  • Guo-Hao Xu: xugh23Prevent spamming@Prevent spammingmail2.sysu.edu.cn
  • Wei-Shi Zheng: wszhengPrevent spamming@Prevent spammingieee.org