Publication | Closed Access
Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering
703
Citations
39
References
2017
Year
Unknown Venue
Natural Language ProcessingVisual ContentMultimodal LlmEngineeringMachine LearningText-to-image RetrievalVisual GroundingVision Language ModelMultimodal LearningMulti-modal Factorized BilinearVisual Question AnsweringComputer ScienceTextual ContentAttentionDeep LearningCo-attention LearningComputer Vision
Visual question answering demands simultaneous understanding of image and question content, and effective multimodal representation and fusion are critical, yet bilinear pooling models, though superior to linear ones, suffer from high dimensionality and computational cost. This work proposes a Multi‑modal Factorized Bilinear (MFB) pooling method and a co‑attention mechanism to efficiently fuse visual and textual features and enhance VQA performance. The authors integrate MFB pooling with co‑attention in an end‑to‑end deep network, yielding a unified architecture that jointly learns image and question attentions. Experiments show that the single MFB‑co‑attention model attains new state‑of‑the‑art results on the real‑world VQA dataset. Code is available at https://github.com/yuzcccc/mfb.
Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both the visual content of images and the textual content of questions. The approaches used to represent the images and questions in a fine-grained manner and questions and to fuse these multimodal features play key roles in performance. Bilinear pooling based models have been shown to outperform traditional linear models for VQA, but their high-dimensional representations and high computational complexity may seriously limit their applicability in practice. For multimodal feature fusion, here we develop a Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal features, which results in superior performance for VQA compared with other bilinear pooling approaches. For fine-grained image and question representation, we develop a `co-attention' mechanism using an end-to-end deep network architecture to jointly learn both the image and question attentions. Combining the proposed MFB approach with co-attention learning in a new network architecture provides a unified model for VQA. Our experimental results demonstrate that the single MFB with co-attention model achieves new state-of-theart performance on the real-world VQA dataset. Code available at https://github.com/yuzcccc/mfb.
| Year | Citations | |
|---|---|---|
Page 1
Page 1