Pulling Things out of Perspective

TLDR

The limitations of current single‑view depth estimation and semantic segmentation methods stem from perspective geometry, where perceived object size inversely scales with distance. We aim to simplify pixel‑wise depth classification by predicting the likelihood of a pixel being at a fixed canonical depth, exploiting this perspective property. The approach trains a single canonical‑depth likelihood classifier, applies it to transformed images for other depths, and conditions depth and semantic labels on each other to align data to physical scale and resolve ambiguities. On KITTI and NYU2 datasets, the method generalizes to multiple semantic classes and significantly outperforms state‑of‑the‑art depth estimation and semantic segmentation.

Abstract

The limitations of current state-of-the-art methods for single-view depth estimation and semantic segmentations are closely tied to the property of perspective geometry, that the perceived size of the objects scales inversely with the distance. In this paper, we show that we can use this property to reduce the learning of a pixel-wise depth classifier to a much simpler classifier predicting only the likelihood of a pixel being at an arbitrarily fixed canonical depth. The likelihoods for any other depths can be obtained by applying the same classifier after appropriate image manipulations. Such transformation of the problem to the canonical depth removes the training data bias towards certain depths and the effect of perspective. The approach can be straight-forwardly generalized to multiple semantic classes, improving both depth estimation and semantic segmentation performance by directly targeting the weaknesses of independent approaches. Conditioning the semantic label on the depth provides a way to align the data to their physical scale, allowing to learn a more discriminative classifier. Conditioning depth on the semantic class helps the classifier to distinguish between ambiguities of the otherwise ill-posed problem. We tested our algorithm on the KITTI road scene dataset and NYU2 indoor dataset and obtained obtained results that significantly outperform current state-of-the-art in both single-view depth and semantic segmentation domain.

References

Page 1

	Year	Citations

Page 1