Deep Modular Co-Attention Networks for Visual Question Answering

TLDR

VQA demands fine‑grained simultaneous understanding of visual and textual content, and co‑attention models that link key question words to image objects are central, yet existing deep co‑attention approaches offer little improvement over shallow ones. This work introduces a deep Modular Co‑Attention Network (MCAN) that stacks Modular Co‑Attention layers in depth. Each MCA layer jointly models question and image self‑attention and question‑guided image attention through a modular composition of two basic attention units, and the network is evaluated on VQA‑v2 with extensive ablation studies. MCAN achieves a state‑of‑the‑art overall accuracy of 70.63 % on the VQA‑v2 test‑dev set, markedly outperforming prior methods.

Abstract

Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the question-guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63% overall accuracy on the test-dev set.

References

Page 1

	Year	Citations

Page 1