Authors
Tero Karras, Samuli Laine, Timo Aila
NVIDIA
Portals
Summary
Our generator architecture makes it possible to control the image synthesis via scale-specific modifications to the styles. Our generator starts from a learned constant input and adjusts the “style” of the image at each convolution layer based on the latent code, therefore directly controlling the strength of image features at different scales. Combined with noise injected directly into the network, this architectural change leads to automatic, unsupervised separation of high-level attributes (e.g., pose, identity) from stochastic variation (e.g., freckles, hair) in the generated images, and enables intuitive scale-specific mixing and interpolation operations.
Abstract
We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.
Contribution
- We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. We have collected a new dataset of human faces, FlickrFaces-HQ (FFHQ), consisting of 70,000 high-quality images at 1024 x 1024 resolution
Overview
While a traditional generator feeds the latent code though the input layer only, we first map the input to an intermediate latent space W, which then controls the generator through adaptive instance normalization (AdaIN) at each convolution layer. Gaussian noise is added after each convolution, before evaluating the nonlinearity. Here “A” stands for a learned affine transform, and “B” applies learned per-channel scaling factors to the noise input. The mapping network f consists of 8 layers and the synthesis network g consists of 18 layers — two for each resolution. The output of the last layer is converted to RGB using a separate 1 × 1 convolution, similar to Karras et al. . Our generator has a total of 26.2M trainable parameters, compared to 23.1M in the traditional generator.