Publication | Open Access
BERT Representations for Video Question Answering
107
Citations
30
References
2020
Year
Unknown Venue
EngineeringMachine LearningVideo SummarizationVideo RetrievalCorpus LinguisticsVideo InterpretationNatural Language ProcessingVisual ContentMultimodal LlmImage AnalysisVisual GroundingComputational LinguisticsVisual FeaturesVisual Question AnsweringLanguage StudiesMachine TranslationVision Language ModelComputer ScienceDeep LearningComputer VisionBert RepresentationsLinguistics
Visual question answering seeks to answer questions about images or videos, but most research has focused on images, while video VQA requires modeling temporal visual features and associated subtitles. The authors propose to use BERT, a Transformer‑based sequential model, to encode complex semantics from video clips. Their model jointly captures visual and language information by encoding subtitles and a sequence of visual concepts with a pretrained language‑based Transformer. Experiments on TVQA and Pororo show the model achieves substantial improvements over previous methods.
Visual question answering (VQA) aims at answering questions about the visual content of an image or a video. Currently, most work on VQA is focused on image-based question answering, and less attention has been paid into answering questions about videos. However, VQA in video presents some unique challenges that are worth studying: it not only requires to model a sequence of visual features over time, but often it also needs to reason about associated subtitles. In this work, we propose to use BERT, a sequential modelling technique based on Transformers, to encode the complex semantics from video clips. Our proposed model jointly captures the visual and language information of a video scene by encoding not only the subtitles but also a sequence of visual concepts with a pretrained language-based Transformer. In our experiments, we exhaustively study the performance of our model by taking different input arrangements, showing outstanding improvements when compared against previous work on two well-known video VQA datasets: TVQA and Pororo.
| Year | Citations | |
|---|---|---|
Page 1
Page 1