Learning Hierarchical Self-Attention for Video Summarization

Abstract

Video summarization still remains a challenging task. Due to sufficient video data on the Internet, such task draws significant attention in the vision community and benefits a wide range of applications, e.g., video retrieval, search, etc. To effectively perform video summarization by deriving the keyframes which represent the given input video, we propose a novel framework named Hierarchical Multi-Attention Network (H-MAN) which comprises the shot-level reconstruction model and multi-head attention model. While our designed attention model is two-stage hierarchical structure for producing various attention maps, we are among the first to utilize the multi-attention mechanism in the video summarization task, which brings improved performance. The quantitative and qualitative results demonstrate the effectiveness of our model, which performs favorably against state-of-the-art approaches.

References

Page 1

	Year	Citations

Page 1