Publication | Closed Access
EgoVQA - An Egocentric Video Question Answering Benchmark Dataset
29
Citations
45
References
2019
Year
Unknown Venue
Artificial IntelligenceVideo Question AnsweringEngineeringMachine LearningVideo RetrievalEgovqa DatasetVideo InterpretationNatural Language ProcessingVisual GroundingData ScienceAffective ComputingVisual Question AnsweringMachine VisionVision Language ModelComputer ScienceVideo UnderstandingDeep LearningComputer VisionEye TrackingHuman-computer Interaction
Recently, much effort and attention has been devoted to Visual Question Answering (VQA) on static images and Video Question Answering (VideoQA) on third-person videos. In the meantime, first-person question answering has more natural use cases while this topic remains seldom studied. A typical meaningful scenario is an intelligent agent provides assistance to handicapped people to perceive the environment by the queries, localize objects and persons based on descriptions, and identify intentions of surrounding people to guide their reactions (e.g., shake hands or avoid punches). However, due to the lack of first-person video datasets, seldom study had been carried on first-person VideoQA task. To address this issue, we collected a novel egocentric VideoQA dataset called EgoVQA with 600 question-answer pairs with visual contents across 5,000 frames from 16 first-person videos. Various types of queries such as "Who", "What", "How many" are provided to form a semantically rich corpus. We use this database to evaluate the performance of four mainstream third-person VideoQA methods to illustrate their performance gap between first-person related questions and third-person related questions. We believe that EgoVQA dataset will facilitate future research on the imperative task of first-person VideoQA.
| Year | Citations | |
|---|---|---|
Page 1
Page 1