ATT3D: Amortized Text-to-3D Object Synthesis

Authors

Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, James Lucas

NVIDIA; University of Toronto; Vector Institute

Portals

Summary

Our method initially trains one network to output 3D objects consistent with various text prompts. After, when we receive an unseen prompt, we produce an accurate object in < 1 second, with 1 GPU. Existing methods re-train the entire network for every prompt, requiring a long delay for the optimization to complete. Further, we can interpolate between prompts for user-guided asset generation.

Abstract

Text-to-3D modeling has seen exciting progress by combining generative text-to-image models with image-to-3D methods like Neural Radiance Fields. DreamFusion recently achieved high-quality results but requires a lengthy, per-prompt optimization to create 3D objects. To address this, we amortize optimization over text prompts by training on many prompts simultaneously with a unified model, instead of separately. With this, we share computation across a prompt set, training in less time than per-prompt optimization. Our framework - Amortized Text-to-3D (ATT3D) - enables sharing of knowledge between prompts to generalize to unseen setups and smooth interpolations between text for novel assets and simple animations.

Contribution

Generalize to new prompts
Interpolate between prompts
Amortize over settings other than text prompts
Reduce overall training time

Related Works

NeRFs for Image-to-3D; Text-to-Image Generation; Text-to-3D (TT3D) Generation; Amortized Optimization; Image-to-3D Models; Text-to-3D Animation

Overview

We show a schematic of our text-to-3D pipeline with changes from DreamFusion’s pipeline [1] shown in red and pseudocode in Alg. 1. The text encoder (in green) provides its – potentially cached – text embedding c to the text-to-image DDM and now also to the mapping network m (in red). We use a spatial point-encoder ?m(c) (in blue) for our position x, whose parameters are modulations from the mapping network m(c). The final NeRF MLP ? outputs a radiance r given the point encoding: r = ?(?m(c)(x)), which we render into views. Left: At training time, the rendered views are input to the DDM to provide a training update. The NeRF network ?, mapping network m, and (effectively) the spatial point encoding ?m(c) are optimized. Right: At inference time, we use the pipeline up to the NeRF for representing the 3D object.

PDF Preview

2306.07349

ATT3D: Amortized Text-to-3D Object Synthesis

ATT3D: Amortized Text-to-3D Object Synthesis

Authors

Portals

Summary

Abstract

Contribution

Related Works

Overview

PDF Preview

Like this:

Leave a Reply Cancel reply

ATT3D: Amortized Text-to-3D Object Synthesis

ATT3D: Amortized Text-to-3D Object Synthesis

Authors

Portals

Summary

Abstract

Contribution

Related Works

Overview

PDF Preview

Like this:

You may also Like:

NeRF-Art: Text-Driven Neural Radiance Fields Stylization

ViCA-NeRF: View-Consistency-Aware 3D Editing of Neural Radiance Fields

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

Leave a Reply Cancel reply