1 School of Computer Science and Engineering, Sun Yat-sen University, China
2 Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China
* equal contribution
† corresponding author
Conditional generative models may consistently generate nearly identical outputs (given the same input) at inference time due to the powerful condition, except for a diffusion-based model, which can generate diverse grasps but with low quality.
Alternatively, vanilla discriminative models can only predict a single grasp pose for one input object.
We formulate dexterous grasp generation as a set prediction task and design a transformer-based grasping model inspired by the impressive success of Detection Transformers.
However, we identify that this set prediction paradigm encounters several optimization challenges in the field of dexterous grasping.
The input of DGTR is the complete point cloud $\mathcal{O}$ of an object. First, the PointNet++ encoder downsamples the point cloud and extracts a set of object features. Next, the transformer decoder takes $N$ learnable query embeddings as well as the object features as input and predicts $N$ diverse grasp poses in parallel. In the dynamic matching training stage, our model is trained with the matching result produced by Hungarian Algorithm and without object penetration loss. In the static matching training stage, we use static matching recorded in the DMT stage to train the model with object penetration loss. At test time, we adopt an adversarial-balanced loss to directly finetune the hand pose parameters.
DMT: The matching between the predictions and the targets are dynamically generated by the Hungarian Algorithm. The object penetration loss is excluded.
SMW: The stable matching results recorded in the DMT stage are used. The object penetration loss is excluded.
SMPT: The stable matching results are still used and the object penetration loss and the hand-object distance loss are incorporated.
Translation Moderation: Downscale the gradient of the global translation with $\beta_{t}$
TTA-distance loss: $\mathcal{L}_{tta-dist} = \sum_{i}{\mathbb{I}((d(p^{c}_{i}) < \tau) \lor (d(p^{r}_{i}) < \tau)) * d(p^{r}_{i})}$
@inproceedings{xu2024dexterous,
title={Dexterous Grasp Transformer},
author={Xu, Guo-Hao and Wei, Yi-Lin and Zheng, Dian and Wu, Xiao-Ming and Zheng, Wei-Shi},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}
If you have any questions, please feel free to contact us: