Grounded Compositional Semantics for Finding and Describing Images with Sentences

TLDR

Recursive neural networks can generate compositional vectors for sentences and images, yet their sentence embeddings fail to capture visually grounded meaning. This work proposes the DT‑RNN, which embeds sentences via dependency trees into a vector space to enable image retrieval based on sentence descriptions. DT‑RNNs replace constituency trees with dependency trees, emphasizing actions and agents and abstracting from word order and syntactic variation. DT‑RNNs outperform recursive and recurrent neural nets, kernelized CCA, and a bag‑of‑words baseline on image‑sentence retrieval and produce more similar embeddings for sentences describing the same image.

Abstract

Previous work on Recursive Neural Networks (RNNs) shows that these models can produce compositional feature vectors for accurately representing and classifying sentences or images. However, the sentence vectors of previous models cannot accurately represent visually grounded meaning. We introduce the DT-RNN model which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences. Unlike previous RNN-based models which use constituency trees, DT-RNNs naturally focus on the action and agents in a sentence. They are better able to abstract from the details of word order and syntactic expression. DT-RNNs outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa. They also give more similar representations to sentences that describe the same image.

References

Page 1

	Year	Citations

Page 1