NeSF: Neural Semantic Fields for Generalizable Semantic Segmentation of\n 3D Scenes

Abstract

We present NeSF, a method for producing 3D semantic fields from posed RGB\nimages alone. In place of classical 3D representations, our method builds on\nrecent work in implicit neural scene representations wherein 3D structure is\ncaptured by point-wise functions. We leverage this methodology to recover 3D\ndensity fields upon which we then train a 3D semantic segmentation model\nsupervised by posed 2D semantic maps. Despite being trained on 2D signals\nalone, our method is able to generate 3D-consistent semantic maps from novel\ncamera poses and can be queried at arbitrary 3D points. Notably, NeSF is\ncompatible with any method producing a density field, and its accuracy improves\nas the quality of the density field improves. Our empirical analysis\ndemonstrates comparable quality to competitive 2D and 3D semantic segmentation\nbaselines on complex, realistically rendered synthetic scenes. Our method is\nthe first to offer truly dense 3D scene segmentations requiring only 2D\nsupervision for training, and does not require any semantic input for inference\non novel scenes. We encourage the readers to visit the project website.\n