Temporally Grounding Natural Sentence in Video

TLDR

The study introduces an efficient method to localize natural sentences within long, untrimmed videos. They develop Temporal GroundNet, which scores temporal candidates using frame‑by‑word interactions, aggregates historical context, and produces a single‑pass grounding result. Experiments on three public datasets show that Temporal GroundNet outperforms existing methods and remains efficient, as confirmed by ablation and runtime tests.

Abstract

We introduce an effective and efficient method that grounds (i.e., localizes) natural sentences in long, untrimmed video sequences. Specifically, a novel Temporal GroundNet (TGN) is proposed to temporally capture the evolving fine-grained frame-by-word interactions between video and sentence. TGN sequentially scores a set of temporal candidates ended at each frame based on the exploited frame-by-word interactions, and finally grounds the segment corresponding to the sentence. Unlike traditional methods treating the overlapping segments separately in a sliding window fashion, TGN aggregates the historical information and generates the final grounding result in one single pass. We extensively evaluate our proposed TGN on three public datasets with significant improvements over the state-of-the-arts. We further show the consistent effectiveness and efficiency of TGN through an ablation study and a runtime test.

References

Page 1

	Year	Citations

Page 1