Publication | Closed Access
ActBERT: Learning Global-Local Video-Text Representations
401
Citations
49
References
2020
Year
Unknown Venue
Artificial IntelligenceNatural Language ProcessingJoint Video-text RepresentationsMultimodal LlmEngineeringMachine LearningPattern RecognitionVision Language ModelVideo SummarizationAction Step LocalizationVideo UnderstandingDeep LearningJoint Video-text RepresentationLinguisticsVideo InterpretationComputer VisionRepresentation Learning
In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint video-text representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperform the state-of-the-arts, demonstrating its superiority in video-text representation learning.
| Year | Citations | |
|---|---|---|
Page 1
Page 1