Semantically-Guided Representation Learning for Self-Supervised\n Monocular Depth

Abstract

Self-supervised learning is showing great promise for monocular depth\nestimation, using geometry as the only source of supervision. Depth networks\nare indeed capable of learning representations that relate visual appearance to\n3D properties by implicitly leveraging category-level patterns. In this work we\ninvestigate how to leverage more directly this semantic structure to guide\ngeometric representation learning, while remaining in the self-supervised\nregime. Instead of using semantic labels and proxy losses in a multi-task\napproach, we propose a new architecture leveraging fixed pretrained semantic\nsegmentation networks to guide self-supervised representation learning via\npixel-adaptive convolutions. Furthermore, we propose a two-stage training\nprocess to overcome a common semantic bias on dynamic objects via resampling.\nOur method improves upon the state of the art for self-supervised monocular\ndepth prediction over all pixels, fine-grained details, and per semantic\ncategories.\n