UNIG: MODELLING UNITARY 3D GAUSSIANS FOR VIEW-CONSISTENT 3D RECONSTRUCTION

UniG: Modelling Unitary 3D Gaussians for View-Consistent 3D Reconstruction

¹Hong Kong University of Science and Technology,
²International Digital Economy Academy (IDEA),
³The Chinese University of Hong Kong (Shenzhen),
³Tsinghua University
arXiv:2410.13195
^*Equal Contribution
^†Corresponding Author

Abstract

In this work, we present UniG, a view-consistent 3D reconstruction and novel view synthesis model that generates a high-fidelity representation of 3D Gaussians from sparse images. Existing 3D Gaussians-based methods usually regress Gaussians per-pixel of each view, create 3D Gaussians per view separately, and merge them through point concatenation. Such a view-independent reconstruction approach often results in a view inconsistency issue, where the predicted positions of the same 3D point from different views may have discrepancies. To address this problem, we develop a DETR (DEtection TRansformer)-like framework, which treats 3D Gaussians as decoder queries and updates their parameters layer by layer by performing multi-view cross-attention (MVDFA) over multiple input images. In this way, multiple views naturally contribute to modeling a unitary representation of 3D Gaussians, thereby making 3D reconstruction more view-consistent. Moreover, as the number of 3D Gaussians used as decoder queries is irrespective of the number of input views, allow an arbitrary number of input images without causing memory explosion. Extensive experiments validate the advantages of our approach, showcasing superior performance over existing methods quantitatively (improving PSNR by 4.2 dB when trained on Objaverse and tested on the GSO benchmark) and qualitatively.

Method

Overall Framework: In the coarse stage, 3D Gaussians are produced for each pixel of the sampled random views from the input data. In the refinement stage, 3D Gaussians from the coarse stage serves as the initialization for the refinement network. Multi-view features extracted by the feature extractor serves as keys and values of decoder. Queries are updated by the decoder layer with image features and the positions of the centers of 3D Gaussians. The final 3D Gaussian representation is regressed from the queries. MVDFA: multi-view deformable attention. SESA: spatial efficient self-attention. FFN: feed-forward network.

Comparisons

Quantitative results for inputting 4 views on GSO-fixed dataset

Quantitative results for inputting 4 views on GSO-random dataset

Quantitative results with random number of views as input

@article{wu2024unig, title={UniG: Modelling Unitary 3D Gaussians for View-consistent 3D Reconstruction}, author={Wu, Jiamin and Liu, Kenkun and Shi, Yukai and Jiang, Xiaoke and Yao, Yuan and Zhang, Lei}, journal={arXiv preprint arXiv:2410.13195}, year={2024} }