Authors
Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, Xin Tong
Tsinghua University; Microsoft Research Asia; Beijing Institute of Technology
Portals
Summary
This paper presents a hybrid-level weakly-supervised training for CNN-based 3D face reconstruction. It is fast, accurate, and robust to pose and occlusions. Also, the method produces high-fidelity face textures meanwhile preserving identity information of input images.
Abstract
Recently, deep learning based 3D face reconstruction methods have shown promising results in both quality and efficiency.However, training deep neural networks typically requires a large volume of data, whereas face images with ground-truth 3D face shapes are scarce. In this paper, we propose a novel deep 3D face reconstruction approach that 1) leverages a robust, hybrid loss function for weakly-supervised learning which takes into account both low-level and perception-level information for supervision, and 2) performs multi-image face reconstruction by exploiting complementary information from different images for shape aggregation. Our method is fast, accurate, and robust to occlusion and large pose. We provide comprehensive experiments on three datasets, systematically comparing our method with fifteen recent methods and demonstrating its state-of-the-art performance.
Contribution
- We propose a CNN-based single-image face reconstruction method that exploits hybrid-level image information for weakly-supervised learning. Our loss consists of a robustified image-level loss and a perception-level loss. We demonstrate the benefit of combing them, and show the state-of-the-art accuracy of our method on multiple datasets, significantly outperforming previous methods trained in a fully supervised fashion. Moreover, we show that with a low-dimensional 3DMM subspace, we are still able to outperform prior art with “unrestricted” 3D representations by an appreciable margin
- We propose a novel shape confidence learning scheme for multi-image face reconstruction aggregation. Our confidence prediction subnet is also trained in a weakly-supervised fashion without ground-truth label. We show that our method clearly outperforms naive aggregation (e.g., shape averaging) and some heuristic strategies. To our knowledge, this is the first attempt towards CNN-based 3D face reconstruction and aggregation from an unconstrained image set
Related Works
3D Morphable Models; CNN
Comparisons
VRN,3DDFA
Overview
a) The framework of our method, which consists of a reconstruction network for end-to-end single image 3D reconstruction and a confidence measurement subnet designed for multi-image based reconstruction; b) The training pipeline for single images with our proposed hybrid-level loss functions. Our method does not require any ground-truth 3D shapes for training. It only leverages some weak supervision signals such as facial landmarks, skin mask and a pre-trained face recognition CNN; c) The training pipeline for multi-image based reconstruction. Our confidence subnet learns to measure the reconstruction confidence for aggregation with out any explicit label. The dashed arrows denote error backpropagration for network training;