Authors
CHEN CAO, TOMAS SIMON, JIN KYU KIM, GABE SCHWARTZ, MICHAEL ZOLLHOEFER, SHUNSUKE SAITO, STEPHEN LOMBARDI, SHIH-EN WEI, DANIELLE BELKO, SHOOU-I YU, YASER SHEIKH,and JASON SARAGIH
Reality Labs
Portals
Summary
We presents a novel approach to create volumetric avatars using only a short phone capture as input.The resulting avatars produce high-fidelity renderings from novel viewpoints in realtime, and can generate novel animations using a common latent space of expressions.
Abstract
Creating photorealistic avatars of existing people currently requires extensive person-specific data capture, which is usually only accessible to the VFX industry and not the general public. Our work aims to address this drawback by relying only on a short mobile phone capture to obtain a drivable 3D head avatar that matches a person’s likeness faithfully. In contrast to existing approaches, our architecture avoids the complex task of directly modeling the entire manifold of human appearance, aiming instead to generate an avatar model that can be specialized to novel identities using only small amounts of data. The model dispenses with low-dimensional latent spaces that are commonly employed for hallucinating novel identities, and instead, uses a conditional representation that can extract person-specific information at multiple scales from a high resolution registered neutral phone scan. We achieve high quality results through the use of a novel universal avatar prior that has been trained on high resolution multi-view video captures of facial performances of hundreds of human subjects. By fine-tuning the model us ing inverse rendering we achieve increased realism and personalize its range of motion. The output of our approach is not only a high-fidelity 3D head avatar that matches the person’s facial shape and appearance, but one that can also be driven using a jointly discovered shared global expression space with disentangled controls for gaze direction. Via a series of experiments we demonstrate that our avatars are faithful representations of the subject’s likeness. Compared to other state-of-the-art methods for lightweight avatar creation, our approach exhibits superior visual quality and animateability.
Contribution
- A system for producing a lifelike avatar of a person, with unprecedented appearance, structure and motion quality compared to existing approaches
- A novel hypernetwork architecture that can produce high quality expressive avatars of a person given their neutral texture and geometry that preserves person-specific details. The resulting avatar has a consistent expression latent space with disentangled controls for viewpoint, expression, and gaze direction. The model is robust against real-world variations in the conditioning signal, including variations due to lighting, sensor noise, and limited resolution
- An inverse-rendering strategy that specializes the avatar’s expression space to the user given additional frontal cellphone captures, while ensuring viewpoints generalizability and preserving the latent space’s semantics
Related Works
Classical 3D/4D Face Reconstruction; Parametric Face Models; 2D Neural Rendering of Human Heads; 3D Neural Rendering of Human Heads; Light-weight Avatar Generation
Comparisons
stylized avatar creation, paGAN, RGB video-based avatar creation
Overview
Method overview. (a) We employ a large corpus of multi-view facial performances to train a cross-identity hyper network that can generate volumetric avatar representations. (b) The representation can be specialized to unseen individuals by conditioning on a lightweight capture of that person’s neutral expression. (c) We can optionally refine the model using unstructured captures of an individual’s appearance using inverse rendering.