Authors
Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia
University of British Columbia; Vector Institute for AI; Canada CIFAR AI Chair; Google
Portals
Summary
This paper introduces a two-stage transformer-based model to represent view-dependent effects accurately with only sparse views.
Abstract
Classical light field rendering for novel view synthesis can accurately reproduce view-dependent effects such as reflection, refraction, and translucency, but requires a dense view sampling of the scene. Methods based on geometric reconstruction need only sparse views, but cannot accurately model non-Lambertian effects. We introduce a model that combines the strengths and mitigates the limitations of these two directions. By operating on a four-dimensional representation of the light field, our model learns to represent view-dependent effects accurately. By enforcing geometric constraints during training and inference, the scene geometry is implicitly learned from a sparse set of views. Concretely, we introduce a two-stage transformer-based model that first aggregates features along epipolar lines, then aggregates features along reference views to produce the color of a target ray. Our model outperforms the state-of-the-art on multiple forward-facing and 360{\deg} datasets, with larger margins on scenes with severe view-dependent variations.
Contribution
- Our main contribution is the novel light field based neural view synthesis model, capable of photorealistic modeling of non-Lambertian effects (e.g., specularities and translucency). To address the core challenge of sparsity of initial views, we leverage an inductive bias in the form of a multi-view geometric constraint, namely the epipolar geometry, and a transformer-based ray fusion. The resulting model produces higher fidelity renderings for forward-facing as well as 360° captures, compared to stateof-the-art, achieving up to 5 dB improvement in the most challenging scenes. Further, as a byproduct of our design, we can easily obtain dense correspondences and depth without further modifications, as well as transparent visualization of the rendering process itself. Through ablations we illustrate the importance of our individual design choices
Related Works
Light field rendering; Neural scene representation; Image-based rendering
Comparisons
LLFF, NeRF, IBRNet, NeX, Mip-NeRF
Overview
Given a target ray to render, the method identifies reference views and sample points along the epipolar lines corresponding to the target ray. Features of these epipolar points along with the light field coordinates of the target ray are inputs to the epipolar aggregation. This stage (blue), independently aggregates features along the epipolar lines for each reference view, producing reference view features. The reference view features along with the target ray are passed to the view aggregation stage (green), which combines the reference view features to predict the target ray color.