In this paper, we propose DIG3D (Gaussian splatting with deformable Transformer for single image 3D reconstruction), a novel approach for 3D object reconstruction and novel view synthesis from a single-view RGB image. In contrast to directly regressing 3D Gaussian parameters from per-pixel image features, DIG3D employs an encoder-decoder framework where the decoder generates 3D Gaussians guided by depth-aware image features from the encoder. This approach breaks away from predicting shortcuts of the input image, leading to improved both 3D object geometry and rendering accuracy. Particularly, deformable Transformer is empolyed to enable efficient and effective decoding through 3D reference point and multi-layer refinement adaptations. Leveraging the high speed of 3D Gaussian splatting, DIG3D provides an accurate and efficient solution for 3D reconstruction from single-view images. On the ShapeNet SRN dataset (category level) and the Google Scanned Objects dataset (open-category level), DIG3D outperforms previous methods by over 3%, achieving PSNR of 24.96 (average on chairs and cars dataset of ShapeNet SRN) and 21.70, respectively. Furthermore, our method achieves a 3D reconstruction speed of 7.2 FPS, at least 10 times faster than other Transformer-based models with speeds merely around 0.5 FPS.
(a) Overview of DIG3D. -->: steps not utilized in inference. (b) Detailed structure for feature fusion in the encoder. (c) Detailed structure for one decoder layer. Queries are updated at each layer and serve as input for the next layer, while the reference points are updated based on the new centers of the Gaussians and projected onto the image feature plane. DFA: deformable cross attention layer; FFN: Feed Forward Network; ⊕: updation of 3D Gaussian.
@article{wu2024dig3d,
title={DIG3D: Marrying Gaussian Splatting with Deformable Transformer for Single Image 3D Reconstruction},
author={Wu, Jiamin and Liu, Kenkun and Gao, Han and Jiang, Xiaoke and Zhang, Lei},
journal={arXiv preprint arXiv:2404.16323},
year={2024}
}