Concepedia

TLDR

Joint modeling of text and image components in multimedia documents, including cross‑modal retrieval of text from images and vice versa, is the problem studied. The study tests whether explicitly modeling cross‑modal correlations and using higher‑level semantic feature spaces improve cross‑modal document retrieval. Text is modeled via latent Dirichlet allocation topics, images as SIFT feature bags, and cross‑modal correlations are learned with canonical correlation analysis while both modalities are represented at a higher‑level semantic level. Accounting for cross‑modal correlations and semantic abstraction improves retrieval accuracy, and the cross‑modal model also outperforms state‑of‑the‑art image retrieval systems on unimodal tasks.

Abstract

The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investigated: that 1) there is a benefit to explicitly modeling correlations between the two components, and 2) this modeling is more effective in feature spaces with higher levels of abstraction. Correlations between the two components are learned with canonical correlation analysis. Abstraction is achieved by representing text and images at a more general, semantic level. The two hypotheses are studied in the context of the task of cross-modal document retrieval. This includes retrieving the text that most closely matches a query image, or retrieving the images that most closely match a query text. It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy. The cross-modal model is also shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.

References

YearCitations

Page 1