Authors
Dongqing Wang, Tong Zhang, Alaa Abboud, Sabine Süsstrunk
EPFL
Portals
Summary
Given a text instruction (”Remove the flowerpot and flowers”) and a pre-trained NeRF scene, InpaintNeRF360 can remove an arbitrary number of objects from the 3D scene and fill in the missing region with perceptually plausible and view-consistent content.
Abstract
Neural Radiance Fields (NeRF) can generate highly realistic novel views. However, editing 3D scenes represented by NeRF across 360-degree views, particularly removing objects while preserving geometric and photometric consistency, remains a challenging problem due to NeRF\'s implicit scene representation. In this paper, we propose InpaintNeRF360, a unified framework that utilizes natural language instructions as guidance for inpainting NeRF-based 3D scenes.Our approach employs a promptable segmentation model by generating multi-modal prompts from the encoded text for multiview segmentation. We apply depth-space warping to enforce viewing consistency in the segmentations, and further refine the inpainted NeRF model using perceptual priors to ensure visual plausibility. InpaintNeRF360 is capable of simultaneously removing multiple objects or modifying object appearance based on text instructions while synthesizing 3D viewing-consistent and photo-realistic inpainting. Through extensive experiments on both unbounded and frontal-facing scenes trained through NeRF, we demonstrate the effectiveness of our approach and showcase its potential to enhance the editability of implicit radiance fields.
Related Works
Image Inpainting; Inpainting Neural Radiance Fields; Object Segmentation with 3D consistency; Text Instructed 3D Editing
Comparisons
Overview
InpaintNeRF360 begins with the input of a pre-trained NeRF model and its source image dataset. These images are then encoded, along with text instructions, for object detection. To improve the accuracy of the generated 2D bounding box, we incorporate depth information from the pre-trained NeRF. Using point-based prompts, we segment the rendered images to obtain detailed segmentation masks. With the generated masks and corresponding rendered RGB images, we employ a 2D image inpainter to generate inpainting priors. Finally, we train a new NeRF model using a masked LPIPS loss with the inpainting priors to remove the desired objects while preserving perceptual quality.