Concepedia

TLDR

Bilinear models provide richer representations than linear models and have achieved state‑of‑the‑art results in tasks such as object recognition, segmentation, and visual question answering, but their high dimensionality limits practical use. This work proposes a low‑rank bilinear pooling scheme that employs the Hadamard product to enable an efficient attention mechanism for multimodal learning. The method implements low‑rank bilinear pooling via element‑wise multiplication, reducing dimensionality while preserving expressive power. On the VQA dataset, the proposed model surpasses compact bilinear pooling, attaining state‑of‑the‑art performance with improved parsimony.

Abstract

Bilinear models provide rich representations compared with linear models. They have been applied in various visual tasks, such as object recognition, segmentation, and visual question-answering, to get state-of-the-art performances taking advantage of the expanded representations. However, bilinear representations tend to be high-dimensional, limiting the applicability to computationally complex tasks. We propose low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimodal learning. We show that our model outperforms compact bilinear pooling in visual question-answering tasks with the state-of-the-art results on the VQA dataset, having a better parsimonious property.