GLNet: Global Local Network for Weakly Supervised Action Localization

Abstract

In this paper, we address the challenging problem of weakly supervised spatio-temporal action localization for which only video-level action labels are available during training. To solve this problem, we propose an end-to-end Global Local Network (GLNet) to predict the probability distribution simultaneously in both spatial and temporal space. The proposed GLNet model includes two key components: a local spatial module and a global temporal module. The local spatial module aims to predict the frame-level spatial distribution by encoding short-term temporal information. In particular, we propose a Region Actionness Network (RAN) to select the target region boxes from the precomputed exhaustive proposals. The global temporal module can predict temporal distribution by a long-term temporal structure modelling. Specifically, we design a temporal fusion-and-excitation architecture on the top of several clips, and trained by a sparse loss function. Therefore, the proposed GLNet model can perform spatio-temporal action localization in an end-to-end manner. We evaluate the performance of GLNet on the J-HMDB and UCF101-24 datasets. The experimental results demonstrate GLNet achieves a significant margin against other state-of-the-art weakly supervised methods and even some fully supervised methods in terms of frame mean Average Precision (mAP) and the video mAP (called frame-mAP and video-mAP, respectively).

References

Page 1

	Year	Citations

Page 1