Authors
Aidan Clark, Jeff Donahue, Karen Simonyan
DeepMind London
Portals
Summary
We focus on the tasks of video synthesis and video prediction, and aim to extend the strong results of generative image models to the video domain. Building upon the state-of-the-art BigGAN architecture, we introduce an efficient spatio-temporal decomposition of the discriminator which allows us to train on Kinetics-600 – a complex dataset of natural videos an order of magnitude larger than other commonly used datasets.
Abstract
Generative models of natural images have progressed towards high fidelity samples by the strong leveraging of scale. We attempt to carry this success to the field of video modeling by showing that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work. Our proposed model, Dual Video Discriminator GAN (DVD-GAN), scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator. We evaluate on the related tasks of video synthesis and video prediction, and achieve new state-of-the-art Fr\'echet Inception Distance for prediction for Kinetics-600, as well as state-of-the-art Inception Score for synthesis on the UCF-101 dataset, alongside establishing a strong baseline for synthesis on Kinetics-600.
Contribution
- We propose DVD-GAN – a scalable generative model of natural video which produces high-quality samples at resolutions up to 256 × 256 and lengths up to 48 frames
- We achieve state of the art for video synthesis on UCF-101 and prediction on Kinetics-600
- We establish class-conditional video synthesis on Kinetics-600 as a new benchmark for generative video modeling, and report DVD-GAN results as a strong baseline
Related Works
VIDEO SYNTHESIS AND PREDICTION; GENERATIVE ADVERSARIAL NETWORKS; KINETICS-600; EVALUATION METRICS