Publication | Closed Access
Towards unified depth and semantic prediction from a single image
490
Citations
29
References
2015
Year
Unknown Venue
Scene AnalysisEngineeringMachine LearningDepth PredictionDepth MapJoint DepthImage AnalysisData SciencePattern RecognitionSemantic SegmentationMachine VisionDeep LearningComputer VisionImage Understanding3D VisionScene InterpretationScene UnderstandingSemantic PredictionScene Modeling
Depth estimation and semantic segmentation are fundamental image‑understanding tasks that are strongly correlated but typically solved separately or sequentially. The authors propose a unified framework for joint depth and semantic prediction, motivated by the complementary properties of the two tasks. The method first uses a CNN to jointly predict a global depth–semantic layout, then refines it by segmenting the image into local regions for region‑level depth and semantic prediction guided by the global layout, and finally integrates these predictions in a two‑layer hierarchical conditional random field to produce the final depth and semantic maps. The joint network yields more accurate depth predictions than a state‑of‑the‑art depth‑only CNN and achieves state‑of‑the‑art results on both depth and semantic tasks, as shown by the experiments.
Depth estimation and semantic segmentation are two fundamental problems in image understanding. While the two tasks are strongly correlated and mutually beneficial, they are usually solved separately or sequentially. Motivated by the complementary properties of the two tasks, we propose a unified framework for joint depth and semantic prediction. Given an image, we first use a trained Convolutional Neural Network (CNN) to jointly predict a global layout composed of pixel-wise depth values and semantic labels. By allowing for interactions between the depth and semantic information, the joint network provides more accurate depth prediction than a state-of-the-art CNN trained solely for depth prediction [6]. To further obtain fine-level details, the image is decomposed into local segments for region-level depth and semantic prediction under the guidance of global layout. Utilizing the pixel-wise global prediction and region-wise local prediction, we formulate the inference problem in a two-layer Hierarchical Conditional Random Field (HCRF) to produce the final depth and semantic map. As demonstrated in the experiments, our approach effectively leverages the advantages of both tasks and provides the state-of-the-art results.
| Year | Citations | |
|---|---|---|
Page 1
Page 1