Publication | Open Access
Explore Inter-contrast between Videos via Composition for Weakly Supervised Temporal Sentence Grounding
25
Citations
21
References
2022
Year
EngineeringMachine LearningVideo SummarizationCorpus LinguisticsVideo InterpretationNatural Language ProcessingImage AnalysisVisual GroundingComputational LinguisticsTemporal Sentence GroundingTemporal AnnotationsLanguage StudiesComposed VideoMachine TranslationMachine VisionVision Language ModelVideo UnderstandingDeep LearningComputer VisionMulti-modal SummarizationTemporal SentenceLinguistics
Weakly supervised temporal sentence grounding aims to temporally localize the target segment corresponding to a given natural language query, where it provides video-query pairs without temporal annotations during training. Most existing methods use the fused visual-linguistic feature to reconstruct the query, where the least reconstruction error determines the target segment. This work introduces a novel approach that explores the inter-contrast between videos in a composed video by selecting components from two different videos and fusing them into a single video. Such a straightforward yet effective composition strategy provides the temporal annotations at multiple composed positions, resulting in numerous videos with temporal ground-truths for training the temporal sentence grounding task. A transformer framework is introduced with multi-tasks training to learn a compact but efficient visual-linguistic space. The experimental results on the public Charades-STA and ActivityNet-Caption dataset demonstrate the effectiveness of the proposed method, where our approach achieves comparable performance over the state-of-the-art weakly-supervised baselines. The code is available at https://github.com/PPjmchen/Composition_WSTG.
| Year | Citations | |
|---|---|---|
Page 1
Page 1