EgoVQA - An Egocentric Video Question Answering Benchmark Dataset

Abstract

Recently, much effort and attention has been devoted to Visual Question Answering (VQA) on static images and Video Question Answering (VideoQA) on third-person videos. In the meantime, first-person question answering has more natural use cases while this topic remains seldom studied. A typical meaningful scenario is an intelligent agent provides assistance to handicapped people to perceive the environment by the queries, localize objects and persons based on descriptions, and identify intentions of surrounding people to guide their reactions (e.g., shake hands or avoid punches). However, due to the lack of first-person video datasets, seldom study had been carried on first-person VideoQA task. To address this issue, we collected a novel egocentric VideoQA dataset called EgoVQA with 600 question-answer pairs with visual contents across 5,000 frames from 16 first-person videos. Various types of queries such as "Who", "What", "How many" are provided to form a semantically rich corpus. We use this database to evaluate the performance of four mainstream third-person VideoQA methods to illustrate their performance gap between first-person related questions and third-person related questions. We believe that EgoVQA dataset will facilitate future research on the imperative task of first-person VideoQA.

References

Page 1

	Year	Citations

Page 1