UniG: Modelling Unitary 3D Gaussians for View-Consistent 3D Reconstruction

1Hong Kong University of Science and Technology,
2International Digital Economy Academy (IDEA),
3The Chinese University of Hong Kong (Shenzhen),
3Tsinghua University
arXiv:2410.13195

*Equal Contribution

Corresponding Author

3D object reconstruction from sparse-view images


Abstract

In this work, we present UniG, a view-consistent 3D reconstruction and novel view synthesis model that generates a high-fidelity representation of 3D Gaussians from sparse images. Existing 3D Gaussians-based methods usually regress Gaussians per-pixel of each view, create 3D Gaussians per view separately, and merge them through point concatenation. Such a view-independent reconstruction approach often results in a view inconsistency issue, where the predicted positions of the same 3D point from different views may have discrepancies. To address this problem, we develop a DETR (DEtection TRansformer)-like framework, which treats 3D Gaussians as decoder queries and updates their parameters layer by layer by performing multi-view cross-attention (MVDFA) over multiple input images. In this way, multiple views naturally contribute to modeling a unitary representation of 3D Gaussians, thereby making 3D reconstruction more view-consistent. Moreover, as the number of 3D Gaussians used as decoder queries is irrespective of the number of input views, allow an arbitrary number of input images without causing memory explosion. Extensive experiments validate the advantages of our approach, showcasing superior performance over existing methods quantitatively (improving PSNR by 4.2 dB when trained on Objaverse and tested on the GSO benchmark) and qualitatively.


Method

MY ALT TEXT

Overall Framework: In the coarse stage, 3D Gaussians are produced for each pixel of the sampled random views from the input data. In the refinement stage, 3D Gaussians from the coarse stage serves as the initialization for the refinement network. Multi-view features extracted by the feature extractor serves as keys and values of decoder. Queries are updated by the decoder layer with image features and the positions of the centers of 3D Gaussians. The final 3D Gaussian representation is regressed from the queries. MVDFA: multi-view deformable attention. SESA: spatial efficient self-attention. FFN: feed-forward network.


Comparisons


Inference Time

MY ALT TEXT

Inference time comparison. 3D: 3D Gaussian reconstruction time, render: rendering time, inference: time for one inference including one forward and 32 rendering. Unit in seconds.


BibTeX

@article{wu2024unig,
        title={UniG: Modelling Unitary 3D Gaussians for View-consistent 3D Reconstruction},
        author={Wu, Jiamin and Liu, Kenkun and Shi, Yukai and Jiang, Xiaoke and Yao, Yuan and Zhang, Lei},
        journal={arXiv preprint arXiv:2410.13195},
        year={2024}
      }