ConceptFusion: Open-set multimodal 3D mapping

Abstract

modalities such as natural language, images, and audio.We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches.This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU.We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform.We showcase new avenues for blending foundation models with 3D open-set multimodal mapping.We encourage the reader to view the demos on our project page: https://concept-fusion.github.io/

References

Page 1

	Year	Citations

Page 1