Authors
Jiahua Dong; Yu-Xiong Wang;
University of Illinois Urbana-Champaign
Portals
Summary
We propose ViCA-NeRF, a 3D-aware NeRF editing method. Our method can edit more detailed information with text guidance. Meanwhile, our ViCA-NeRF provides controllability to the edit.
Abstract
We introduce ViCA-NeRF, a view-consistency-aware method for 3D editing with text instructions. In addition to the implicit NeRF modeling, our key insight is to exploit two sources of regularization that explicitly propagate the editing information across different views, thus ensuring multi-view consistency. As geometric regularization, we leverage the depth information derived from the NeRF model to establish image correspondence between different views. As learned regularization, we align the latent codes in the 2D diffusion model between edited and unedited images, enabling us to edit key views and propagate the update to the whole scene. Incorporating these two regularizations, our ViCA-NeRF framework consists of two stages. In the initial stage, we blend edits from different views to create a preliminary 3D edit. This is followed by a second stage of NeRF training that is dedicated to further refining the scene’s appearance. Experiments demonstrate that ViCA-NeRF provides more flexible, efficient(3 times faster) editing with higher levels of consistency and details, compared with the state of the art.
Contribution
- We introduce ViCA-NeRF, a view-consistent 3D editing approach that makes 3D editing more flexible and controllable by editing key views
- We propose a blending refinement model and an efficient warm-up strategy that enable consistent editing without the need for iterative updates with NeRF
- We further introduce a refinement procedure that enhances the quality of simpler scenes
Related Works
Text-to-image diffusion models for 2D editing; Implicit 3D Representation; 3D Generation; NeRF Editing
Comparisons
Overview
Our proposed method proposes to decouple NeRF editing to two stages. In the first stage, we sample several key views and edit them through the Instruct-Pix2Pix model. Then, we use the depth map and camera poses to project edited keyframes to other views and obtain a mixup dataset. These images are further refined through our blending model. Then in the second stage, the edited dataset can be directly used to train the NeRF model. Optionally, we can conduct refinement to the dataset according to the updated NeRF.