How2: A Large-scale Dataset for Multimodal Language Understanding

TLDR

Human information processing is inherently multimodal, and language is best understood in a situated context, yet few multimodal datasets exist to support research, limiting cross‑community collaboration. This paper introduces How2, a large multimodal, multilingual instructional video dataset of 80,000 clips (~2,000 h) with word‑level time‑aligned English subtitles, aiming to encourage collaboration across language, speech, and vision communities. How2 provides multimodal and multilingual data, including crowdsourced Portuguese translations of the subtitles, enabling joint processing of visual, audio, and textual modalities. Baseline experiments demonstrate that multimodal models outperform monomodal ones on several language processing tasks, offering insights into the utility of different modalities.

Abstract

Human information processing is inherently multimodal, and language is best understood in a situated context. In order to achieve human-like language processingcapabilities, machines should be able to jointly process multimodal data, and not just text, images, or speech in isolation. Nevertheless, there are very few multimodal datasets to support such research, resulting in a limited interaction among different research communities. In this paper, we introduce How2, a large-scale dataset of instructional videos covering a wide variety of topics across 80,000 clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. In addition to being multimodal, How2 is multilingual: we crowdsourced Portuguese translations of the subtitles. We present results for monomodal and multimodal baselines on several language processing tasks with interesting insights on the utility of different modalities. We hope that by making the How2 dataset and baselines available we will encourage collaboration across language, speech and vision communities

References

Page 1

	Year	Citations

Page 1