In this work we address the challenging problem of multiview 3D surface reconstruction. We introduce a neural network architecture that simultaneously learns the unknown geometry, camera parameters, and a neural renderer that approximates the light reflected from the surface towards the camera. The geometry is represented as a zero level-set of a neural network, while the neural renderer, derived from the rendering equation, is capable of (implicitly) modeling a wide set of lighting conditions and materials. We trained our network on real world 2D images of objects with different material properties, lighting conditions, and noisy camera initializations from the DTU MVS dataset. We found our model to produce state of the art 3D surface reconstructions with high fidelity, resolution and detail.
Related Works
Implicit surface differentiable ray casting; Multi-view surface reconstruction; Neural representation for view synthesis
Recent advances in implicit neural representations and differentiable rendering make it possible to simultaneously recover the geometry and materials of an object from multi-view RGB images captured under unknown static illumination. Despite the promising results achieved, indirect illumination is rarely modeled in previous methods, as it requires expensive recursive path tracing which makes the inverse rendering computationally intractable. In this paper, we propose a novel approach to efficiently recovering spatially-varying indirect illumination. The key insight is that indirect illumination can be conveniently derived from the neural radiance field learned from input images instead of being estimated jointly with direct illumination and materials. By properly modeling the indirect illumination and visibility of direct illumination, interreflection- and shadow-free albedo can be recovered. The experiments on both synthetic and real data demonstrate the superior performance of our approach compared to previous work and its capability to synthesize realistic renderings under novel viewpoints and illumination. Our code and data are available at https://zju3dv.github.io/invrender/.
Related Works
Inverse rendering; Implicit neural representation; Inverse rendering with implicit neural representation; The rendering equation
Neural volume rendering became increasingly popular recently due to its success in synthesizing novel views of a scene from a sparse set of input images. So far, the geometry learned by neural volume rendering techniques was modeled using a generic density function. Furthermore, the geometry itself was extracted using an arbitrary level set of the density function leading to a noisy, often low fidelity reconstruction. The goal of this paper is to improve geometry representation and reconstruction in neural volume rendering. We achieve that by modeling the volume density as a function of the geometry. This is in contrast to previous work modeling the geometry as a function of the volume density. In more detail, we define the volume density function as Laplace's cumulative distribution function (CDF) applied to a signed distance function (SDF) representation. This simple density representation has three benefits: (i) it provides a useful inductive bias to the geometry learned in the neural volume rendering process; (ii) it facilitates a bound on the opacity approximation error, leading to an accurate sampling of the viewing ray. Accurate sampling is important to provide a precise coupling of geometry and radiance; and (iii) it allows efficient unsupervised disentanglement of shape and appearance in volume rendering. Applying this new density representation to challenging scene multiview datasets produced high quality geometry reconstructions, outperforming relevant baselines. Furthermore, switching shape and appearance between scenes is possible due to the disentanglement of the two.
Related Works
Neural Scene Representation & Rendering; Multi-view 3D Reconstruction
We learn a latent space for easy capture, consistent interpolation, and efficient reproduction of visual material appearance. When users provide a photo of a stationary natural material captured under flashlight illumination, first it is converted into a latent material code. Then, in the second step, conditioned on the material code, our method produces an infinite and diverse spatial field of BRDF model parameters (diffuse albedo, normals, roughness, specular albedo) that subsequently allows rendering in complex scenes and illuminations, matching the appearance of the input photograph. Technically, we jointly embed all flash images into a latent space using a convolutional encoder, and -- conditioned on these latent codes -- convert random spatial fields into fields of BRDF parameters using a convolutional neural network (CNN). We condition these BRDF parameters to match the visual characteristics (statistics and spectra of visual features) of the input under matching light. A user study compares our approach favorably to previous work, even those with access to BRDF supervision.
We present PhySG, an end-to-end inverse rendering pipeline that includes a fully differentiable renderer and can reconstruct geometry, materials, and illumination from scratch from a set of RGB input images. Our framework represents specular BRDFs and environmental illumination using mixtures of spherical Gaussians, and represents geometry as a signed distance function parameterized as a Multi-Layer Perceptron. The use of spherical Gaussians allows us to efficiently solve for approximate light transport, and our method works on scenes with challenging non-Lambertian reflectance captured under natural, static illumination. We demonstrate, with both synthetic and real data, that our reconstructions not only enable rendering of novel viewpoints, but also physics-based appearance editing of materials and illumination.
Related Works
Neural Rendering; Material and Environment Estimation; Joint Shape and Appearance Refinement; The Rendering Equation
We describe a technique for real-time rendering of dynamic, spatially-varying BRDFs in static scenes with all-frequency shadows from environmental and point lights. The 6D SVBRDF is represented with a general microfacet model and spherical lobes fit to its 4D spatially-varying normal distribution function (SVNDF). A sum of spherical Gaussians (SGs) provides an accurate approximation with a small number of lobes. Parametric BRDFs are fit on-the-fly using simple analytic expressions; measured BRDFs are fit as a preprocess using nonlinear optimization. Our BRDF representation is compact, allows detailed textures, is closed under products and rotations, and supports reflectance of arbitrarily high specularity. At run-time, SGs representing the NDF are warped to align the half-angle vector to the lighting direction and multiplied by the microfacet shadowing and Fresnel factors. This yields the relevant 2D view slice on-the-fly at each pixel, still represented in the SG basis. We account for macro-scale shadowing using a new, nonlinear visibility representation based on spherical signed distance functions (SSDFs). SSDFs allow per-pixel interpolation of high-frequency visibility without ghosting and can be multiplied by the BRDF and lighting efficiently on the GPU.
We present a technique for approximating isotropic BRDFs and precomputed self-occlusion that enables accurate and efficient prefiltered environment map rendering. Our approach uses a nonlinear approximation of the BRDF as a weighted sum of isotropic Gaussian functions. Our representation requires a minimal amount of storage, can accurately represent BRDFs of arbitrary sharpness, and is above all, efficient to render. We precompute visibility due to self-occlusion and store a low-frequency approximation suitable for glossy reflections. We demonstrate our method by fitting our representation to measured BRDF data, yielding high visual quality at real-time frame rates.
A radiance environment map pre-integrates a constant surface reflectance with the lighting environment. It has been used to generate photo-realistic rendering at interactive speed. However, one of its limitations is that each radiance environment map can only render the object, which has the same surface reflectance as what it integrates. We present a ratio-image based technique to use a radiance environment map to render diffuse objects with different surface reflectance properties. This method has the advantage that it does not require the separation of illumination from reflectance, and it is simple to implement and runs at interactive speed. In order to use this technique for human face relighting, we have developed a technique that uses spherical harmonics to approximate the radiance environment map for any given image of a face. Thus we are able to relight face images when the lighting environment rotates. Another benefit of the radiance environment map is that we can interactively modify lighting by changing the coefficients of the spherical harmonics basis. Finally we can modify the lighting condition of one person's face so that it matches the new lighting condition of a different person's face image assuming the two faces have similar skin albedos.
We introduce a two-stream model for dynamic texture synthesis. Our model is based on pre-trained convolutional networks (ConvNets) that target two independent tasks: (i) object recognition, and (ii) optical flow prediction. Given an input dynamic texture, statistics of filter responses from the object recognition ConvNet encapsulate the per-frame appearance of the input texture, while statistics of filter responses from the optical flow ConvNet model its dynamics. To generate a novel texture, a randomly initialized input sequence is optimized to match the feature statistics from each stream of an example texture. Inspired by recent work on image style transfer and enabled by the two-stream model, we also apply the synthesis approach to combine the texture appearance from one texture with the dynamics of another to generate entirely novel dynamic textures. We show that our approach generates novel, high quality samples that match both the framewise appearance and temporal evolution of input texture. Finally, we quantitatively evaluate our texture synthesis approach with a thorough user study.