Authors
Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, James Lucas
NVIDIA; University of Toronto; Vector Institute
Portals
Summary
Our method initially trains one network to output 3D objects consistent with various text prompts. After, when we receive an unseen prompt, we produce an accurate object in < 1 second, with 1 GPU. Existing methods re-train the entire network for every prompt, requiring a long delay for the optimization to complete. Further, we can interpolate between prompts for user-guided asset generation.
Abstract
Text-to-3D modeling has seen exciting progress by combining generative text-to-image models with image-to-3D methods like Neural Radiance Fields. DreamFusion recently achieved high-quality results but requires a lengthy, per-prompt optimization to create 3D objects. To address this, we amortize optimization over text prompts by training on many prompts simultaneously with a unified model, instead of separately. With this, we share computation across a prompt set, training in less time than per-prompt optimization. Our framework - Amortized Text-to-3D (ATT3D) - enables sharing of knowledge between prompts to generalize to unseen setups and smooth interpolations between text for novel assets and simple animations.
Contribution
- Generalize to new prompts
- Interpolate between prompts
- Amortize over settings other than text prompts
- Reduce overall training time
Related Works
NeRFs for Image-to-3D; Text-to-Image Generation; Text-to-3D (TT3D) Generation; Amortized Optimization; Image-to-3D Models; Text-to-3D Animation
Overview
We show a schematic of our text-to-3D pipeline with changes from DreamFusion’s pipeline [1] shown in red and pseudocode in Alg. 1. The text encoder (in green) provides its – potentially cached – text embedding c to the text-to-image DDM and now also to the mapping network m (in red). We use a spatial point-encoder ?m(c) (in blue) for our position x, whose parameters are modulations from the mapping network m(c). The final NeRF MLP ? outputs a radiance r given the point encoding: r = ?(?m(c)(x)), which we render into views. Left: At training time, the rendered views are input to the DDM to provide a training update. The NeRF network ?, mapping network m, and (effectively) the spatial point encoding ?m(c) are optimized. Right: At inference time, we use the pipeline up to the NeRF for representing the 3D object.