Learning Representations for Multimodal Data with Deep Belief Nets

Abstract

We propose a Deep Belief Network archi-tecture for learning a joint representation of multimodal data. The model defines a prob-ability distribution over the space of mul-timodal inputs and allows sampling from the conditional distributions over each data modality. This makes it possible for the model to create a multimodal representation even when some data modalities are missing. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBN can learn a good generative model of the joint space of image and text in-puts that is useful for filling in missing data so it can be used both for image annotation and image retrieval. We further demonstrate that using the representation discovered by the Multimodal DBN our model can signif-icantly outperform SVMs and LDA on dis-criminative tasks. 1.

References

Page 1

	Year	Citations

Page 1