Learning structured output representation using deep conditional generative models

TLDR

Supervised deep learning has been successfully applied to many recognition problems, yet modeling complex structured output representations that enable probabilistic inference and diverse predictions remains challenging. The study develops a deep conditional generative model using Gaussian latent variables for structured output prediction. The model is trained with stochastic gradient variational Bayes, supports fast stochastic feed‑forward inference, and incorporates input noise injection and a multi‑scale prediction objective to build robust structured prediction algorithms. Experiments show the algorithm outperforms deterministic deep neural networks by generating diverse, realistic structured outputs and, with the complementary training methods, achieves strong pixel‑level segmentation and semantic labeling on Caltech‑UCSD Birds 200 and a subset of Labeled Faces in the Wild.

Abstract

Supervised deep learning has been successfully applied to many recognition problems. Although it can approximate a complex many-to-one function well when a large amount of training data is provided, it is still challenging to model complex structured output representations that effectively perform probabilistic inference and make diverse predictions. In this work, we develop a deep conditional generative model for structured output prediction using Gaussian latent variables. The model is trained efficiently in the framework of stochastic gradient variational Bayes, and allows for fast prediction using stochastic feed-forward inference. In addition, we provide novel strategies to build robust structured prediction algorithms, such as input noise-injection and multi-scale prediction objective at training. In experiments, we demonstrate the effectiveness of our proposed algorithm in comparison to the deterministic deep neural network counterparts in generating diverse but realistic structured output predictions using stochastic inference. Furthermore, the proposed training methods are complimentary, which leads to strong pixel-level object segmentation and semantic labeling performance on Caltech-UCSD Birds 200 and the subset of Labeled Faces in the Wild dataset.

References

Page 1

	Year	Citations

Page 1