Concepedia

TLDR

Visual question answering seeks to answer questions about images or videos, but most research has focused on images, while video VQA requires modeling temporal visual features and associated subtitles. The authors propose to use BERT, a Transformer‑based sequential model, to encode complex semantics from video clips. Their model jointly captures visual and language information by encoding subtitles and a sequence of visual concepts with a pretrained language‑based Transformer. Experiments on TVQA and Pororo show the model achieves substantial improvements over previous methods.

Abstract

Visual question answering (VQA) aims at answering questions about the visual content of an image or a video. Currently, most work on VQA is focused on image-based question answering, and less attention has been paid into answering questions about videos. However, VQA in video presents some unique challenges that are worth studying: it not only requires to model a sequence of visual features over time, but often it also needs to reason about associated subtitles. In this work, we propose to use BERT, a sequential modelling technique based on Transformers, to encode the complex semantics from video clips. Our proposed model jointly captures the visual and language information of a video scene by encoding not only the subtitles but also a sequence of visual concepts with a pretrained language-based Transformer. In our experiments, we exhaustively study the performance of our model by taking different input arrangements, showing outstanding improvements when compared against previous work on two well-known video VQA datasets: TVQA and Pororo.

References

YearCitations

Page 1