Authors
Ori Gordon, Omri Avrahami, Dani Lischinski
The Hebrew University of Jerusalem
Portals
Summary
Given a NeRF scene, our pipeline trains a NeRF generator model guided by a similarity loss defined by a language-image model such as CLIP, to synthesize a new object inside a user-specified ROI. This is achieved by casting rays and sampling points for the rendering process only inside the ROI box. Additionally, our method introduces augmentations and priors to get more natural results. After training, we render the edited scene by blending the sample points generated by the two models along each view ray.
Abstract
Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multiview consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.
Contribution
- can operate on any region of a real-world scene
- modifies only the region of interest, while preserving the rest of the scene without learning a new feature space or requiring a set of two-dimensional masks
- generates natural-looking and view-consistent results that blend with the existing scene
- is not restricted to a specific class or domain
- enables complex text guided manipulations such as object insertion/replacement, objects blending and texture conversion
Related Works
Neural Implicit Representations; NeRF 3D Generation; Editing NeRFs
Comparisons
Volumetric Disentanglement
Overview
. (a) Training: Given a NeRF scene F O ? , our pipeline trains a NeRF generator model F G ? , initialized with F O ? weights and guided by a similarity loss defined by a language-image model such as CLIP [51], to synthesize a new object inside a user-specified ROI. This is achieved by casting rays and sampling points for the rendering process [42] only inside the ROI box. Our method introduces augmentations and priors to get more natural results. (b) Blending process: After training, we render the edited scene by blending the sample points generated by the two models along each view ray.