Multimodal Learning with Deep Boltzmann Machines

TLDR

The model aims to extract a unified representation that fuses multiple modalities. The DBM learns a joint probability density over multimodal inputs, using latent variable states as representations and inferring missing modalities via conditional sampling. Experiments show the multimodal DBM improves classification and information retrieval, outperforming SVMs, LDA, and other deep learning baselines.

Abstract

A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains.

References

Page 1

	Year	Citations

Page 1