Cross-media Retrieval by Learning Rich Semantic Embeddings of Multimedia

Abstract

Cross-media retrieval aims at seeking the semantic association between different media types. Most existing methods paid much attention on learning mapping functions or finding the optimal spaces, but neglected how people accurately cognize images and texts. This paper proposes a brain inspired cross-media retrieval framework to learn rich semantic embeddings of multimedia. Different from directly using off-the-shelf image features, we combine the visual and descriptive senses for an image from the view of human perception via a joint model, called multi-sensory fusion network (MSFN). A topic model based TextNet maps texts into the same semantic space as images according to their shared ground truth labels. Moreover, in order to overcome the limitations of insufficient data for training neural networks and less complexity in text form, we introduce a large-scale image-text dataset, called Britannica dataset. Extensive experiments show the effectiveness of our framework for different lengths of texts on three benchmark datasets as well as Britannica dataset. Most of all, we report the best known average results of Img2Text and Text2Img compared with several state-of-the-art methods.

References

Page 1

	Year	Citations

Page 1