Authors
Xiuming Zhang, Sean Fanello, Yun-Ta Tsai, Tiancheng Sun, Tianfan Xue, Rohit Pandey, Sergio Orts-Escolano, Philip Davidson, Christoph Rhemann, Paul Debevec, Jonathan T. Barron, Ravi Ramamoorthi, William T. Freeman
Massachusetts Institute of Technology; Google; University of California, San Diego
Portals
Summary
Neural Light Transport (NLT) learns to interpolate the 6D light transport function of a surface as a function of the UV coordinate (2 DOFs), incident light direction (2 DOFs), and viewing direction (2 DOFs). The subject is imaged from multiple viewpoints when lit by different directional lights; a geometry proxy is also captured using active sensors. Querying the learned function at different light and/or viewing directions enables simultaneous relighting and view synthesis of this subject. The relit renderings that NLT produces can be combined according to HDRI maps to perform image-based relighting.
Abstract
The light transport (LT) of a scene describes how it appears under different lighting and viewing directions, and complete knowledge of a scene's LT enables the synthesis of novel views under arbitrary lighting. In this paper, we focus on image-based LT acquisition, primarily for human bodies within a light stage setup. We propose a semi-parametric approach to learn a neural representation of LT that is embedded in the space of a texture atlas of known geometric properties, and model all non-diffuse and global LT as residuals added to a physically-accurate diffuse base rendering. In particular, we show how to fuse previously seen observations of illuminants and views to synthesize a new image of the same scene under a desired lighting condition from a chosen viewpoint. This strategy allows the network to learn complex material effects (such as subsurface scattering) and global illumination, while guaranteeing the physical correctness of the diffuse LT (such as hard shadows). With this learned LT, one can relight the scene photorealistically with a directional light or an HDRI map, synthesize novel views with view-dependent effects, or do both simultaneously, all in a unified framework using a set of sparse, previously seen observations. Qualitative and quantitative experiments demonstrate that our neural LT (NLT) outperforms state-of-the-art solutions for relighting and view synthesis, without separate treatment for both problems that prior work requires.
Contribution
- An end-to-end, semi-parametric method for learning to interpolate the 6D light transport function per-subject from real data using convolutional neural networks (Section 3.3)
- A unified framework for simultaneous relighting and view synthesis by embedding networks into a parameterized texture atlas and leveraging as input a set of One-Light-at-A-Time (OLAT) images (Section 3.5)
- A set of augmented texture-space inputs and a residual learning scheme on top of a physically accurate diffuse base, which together allow the network to easily learn non-diffuse, higher-order light transport effects including specular highlights, subsurface scattering, and global illumination
Related Works
Single observation; Multiple views; Multiple illuminants; Multiple views and illuminants
Comparisons
Diffuse Base, Barycentric Blending, Deep Shading, Xu et al., Relightables
Overview
Our network consists of two paths. The “observation paths” take as input ? nearby observations (as texture-space residual maps) sampled around the target light and viewing directions, and encode them into multiscale features that are pooled to remove the dependence on their order and number. These pooled features are then concatenated to the feature activations of the “query path,” which takes as input the desired light and viewing directions (in the form of cosine maps) as well as the physically accurate diffuse base (also in the texture space). This path predicts a residual map that is added to the diffuse base to produce the texture rendering. With the (differentiable) UV wrapping pre-defined by the geometry proxy, we then resample the texture-space rendering back into the camera space, where the prediction is compared against the ground-truth image. Because the entire network is embedded in the texture space of a subject, the same model can be trained to perform relighting, view synthesis, or both simultaneously, depending on the input and supervision.