Multimodal Representation Learning via Maximization of Local Mutual\n Information

Abstract

We propose and demonstrate a representation learning approach by maximizing\nthe mutual information between local features of images and text. The goal of\nthis approach is to learn useful image representations by taking advantage of\nthe rich information contained in the free text that describes the findings in\nthe image. Our method trains image and text encoders by encouraging the\nresulting representations to exhibit high local mutual information. We make use\nof recent advances in mutual information estimation with neural network\ndiscriminators. We argue that the sum of local mutual information is typically\na lower bound on the global mutual information. Our experimental results in the\ndownstream image classification tasks demonstrate the advantages of using local\nfeatures for image-text representation learning.\n

References

Page 1

	Year	Citations

Page 1