3D Gaussian visibility-guided view selection for novel view synthesis

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Master's thesis

Authors

Department

Mcode

Language

en

Pages

60

Series

Abstract

This thesis studies inference-time view selection for multi-view diffusion-based novel view synthesis in the long-context setting. Modern multi-view diffusion models are typically trained with a fixed conditioning budget, yet real scenes at test time can provide tens to hundreds of candidate reference views, far beyond what the model can ingest. This mismatch between abundant observations and bounded conditioning makes selecting informative references a key factor for generation quality. We consider a two-stage pipeline in which a feed-forward 3D Gaussian Splatting (3DGS) reconstruction backbone estimates camera poses and a lightweight geometric proxy from unordered input views, followed by a multi-view diffusion model that synthesizes target views conditioned on a small selected subset. Our main contribution is a training-free, plug-and-play selection policy that leverages explicit 3D visibility from the reconstructed Gaussians: for each target pose, it estimates which candidate views best explain the target-visible surfaces while accounting for occlusions, and prioritizes complementary coverage under a fixed budget. Concretely, the method reuses the 3DGS rasterizer to render a target-view index map that assigns each visible pixel to the source view associated with the front-most visible surface, thereby providing explicit occlusion-aware visibility cues. It then aggregates these pixel-level votes into per-view coverage scores and ranks candidate references by their geometric utility under the target pose. The selector requires no retraining or fine-tuning of the reconstruction or diffusion backbones and can be applied across different backbone combinations. Experiments on RealEstate10K and a curated challenging dataset show that visibility-aware selection achieves consistent improvements over representative baselines, with particularly strong gains under sparse conditioning and in occluded scenes. Ablation studies confirm that both spatial diversity and pose-based backfill contribute meaningfully to final performance. These results highlight the practical benefits of exploiting explicit 3D Gaussian visibility for inference-time view selection in multi-view generative models.

Description

Supervisor

Kannala, Juho

Thesis advisor

Zhang, Yejun

Other note

Citation