DIG3D: Marrying Gaussian Splatting with Deformable Transformer for Single Image 3D Reconstruction

1Hong Kong University of Science and Technology,
2International Digital Economy Academy (IDEA),
3The Chinese University of Hong Kong (Shenzhen)
arXiv:2404.16323

Corresponding Author

Open-category 3D object reconstruction from a single-view image


Abstract

In this paper, we propose DIG3D (Gaussian splatting with deformable Transformer for single image 3D reconstruction), a novel approach for 3D object reconstruction and novel view synthesis from a single-view RGB image. In contrast to directly regressing 3D Gaussian parameters from per-pixel image features, DIG3D employs an encoder-decoder framework where the decoder generates 3D Gaussians guided by depth-aware image features from the encoder. This approach breaks away from predicting shortcuts of the input image, leading to improved both 3D object geometry and rendering accuracy. Particularly, deformable Transformer is empolyed to enable efficient and effective decoding through 3D reference point and multi-layer refinement adaptations. Leveraging the high speed of 3D Gaussian splatting, DIG3D provides an accurate and efficient solution for 3D reconstruction from single-view images. On the ShapeNet SRN dataset (category level) and the Google Scanned Objects dataset (open-category level), DIG3D outperforms previous methods by over 3%, achieving PSNR of 24.96 (average on chairs and cars dataset of ShapeNet SRN) and 21.70, respectively. Furthermore, our method achieves a 3D reconstruction speed of 7.2 FPS, at least 10 times faster than other Transformer-based models with speeds merely around 0.5 FPS.


Method

MY ALT TEXT

(a) Overview of DIG3D. -->: steps not utilized in inference. (b) Detailed structure for feature fusion in the encoder. (c) Detailed structure for one decoder layer. Queries are updated at each layer and serve as input for the next layer, while the reference points are updated based on the new centers of the Gaussians and projected onto the image feature plane. DFA: deformable cross attention layer; FFN: Feed Forward Network; ⊕: updation of 3D Gaussian.


Results

Category-specific dataset: ShapeNet SRN Chairs


Category-specific dataset: ShapeNet SRN Cars


Open-category dataset: GSO


Comparisons


Speed

MY ALT TEXT

Inference time comparison on ShapeNet SRN. 3D: 3D Reconstruction; R: rendering. Inference: from a image to 250 novel views. Unit in second.


Analysis


MY ALT TEXT

Centres of 3D Gaussians (Point cloud)



BibTeX

@article{wu2024dig3d,
        title={DIG3D: Marrying Gaussian Splatting with Deformable Transformer for Single Image 3D Reconstruction},
        author={Wu, Jiamin and Liu, Kenkun and Gao, Han and Jiang, Xiaoke and Zhang, Lei},
        journal={arXiv preprint arXiv:2404.16323},
        year={2024}
      }