Digging Into Self-Supervised Monocular Depth Estimation

TLDR

Ground‑truth depth data are hard to obtain at scale, so self‑supervised learning has become a promising alternative for monocular depth estimation, with recent advances in architectures, loss functions, and image formation models narrowing the gap to fully supervised methods. The paper proposes a set of improvements that yield quantitatively and qualitatively better depth maps than competing self‑supervised methods. The authors introduce a minimum reprojection loss for occlusion robustness, a full‑resolution multi‑scale sampling scheme to reduce artifacts, and an auto‑masking loss that discards pixels violating camera motion assumptions. The improvements produce quantitatively and qualitatively superior depth maps, with a surprisingly simple model yielding state‑of‑the‑art results on the KITTI benchmark, as each component’s effectiveness was demonstrated in isolation.

Abstract

Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.

References

Page 1

	Year	Citations

Page 1