Authors
Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, Bernard Ghanem
KAUST; Snap Inc.; University of Oxford
Portals
Summary
Magic123 can reconstruct high-fidelity 3D content with detailed 3D geometry and high rendering resolution (1024 × 1024) from a single unposed image in the wild.
Abstract
We present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images.
Contribution
- We introduce Magic123, a novel image-to-3D pipeline that uses a two-stage coarse-to-fine optimization process to produce high-quality high-resolution 3D geometry and textures
- We propose to use 2D and 3D priors simultaneously to generate faithful 3D content from any given image. The strength parameter of priors allows for the trade-off between geometry exploration and exploitation. Users therefore can play with this trade-off parameter to generate desired 3D content
- Moreover, we find a balanced trade-off between 2D and 3D priors, leading to reasonably realistic and detailed 3D reconstructions. Using the exact same set of parameters for all examples without any additional reconfiguration, Magic123 achieves state-of-the-art results in 3D reconstruction from single unposed images in both real-world and synthetic scenarios
Related Works
Multi-view 3D reconstruction; In-domain single-view 3D reconstruction; Zero-shot single-view 3D reconstruction
Comparisons
Shap-E, 3D Fuse, Neural Lift, Real Fusion
Overview
Magic123 is a two-stage coarse-to-fine framework for high-quality 3D generation from a reference image. Magic123 is guided by the reference image, constrained by the monocular depth estimation from the image, and driven by a joint 2D and 3D diffusion prior to dream up novel views. At the coarse stage, we optimize an Instant-NGP neural radiance field (NeRF) to reconstruct a coarse geometry. At the fine stage, we initialize a DMTet mesh from the NeRF output and optimize a high-resolution mesh and texture. Textural inversion is used in both stages to generate object-preserving geometry and view-consistent textures.