Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features

TLDR

Recurrent neural networks and long short‑term memory models have achieved state‑of‑the‑art results in processing sequential multimedia data such as speech, video, and text. The authors propose a novel action recognition approach that processes video data with convolutional neural network features and a deep bidirectional LSTM network. The method extracts deep CNN features from every sixth frame and feeds them into a multi‑layer deep bidirectional LSTM that learns long‑term temporal dependencies, enabling the analysis of lengthy videos. Experiments on UCF‑101, YouTube‑11 Actions, and HMDB51 demonstrate significant performance gains over state‑of‑the‑art action recognition methods.

Abstract

Recurrent neural network (RNN) and long short-term memory (LSTM) have achieved great success in processing sequential multimedia data and yielded the state-of-the-art results in speech recognition, digital signal processing, video processing, and text data analysis. In this paper, we propose a novel action recognition method by processing the video data using convolutional neural network (CNN) and deep bidirectional LSTM (DB-LSTM) network. First, deep features are extracted from every sixth frame of the videos, which helps reduce the redundancy and complexity. Next, the sequential information among frame features is learnt using DB-LSTM network, where multiple layers are stacked together in both forward pass and backward pass of DB-LSTM to increase its depth. The proposed method is capable of learning long term sequences and can process lengthy videos by analyzing features for a certain time interval. Experimental results show significant improvements in action recognition using the proposed method on three benchmark data sets including UCF-101, YouTube 11 Actions, and HMDB51 compared with the state-of-the-art action recognition methods.

References

Page 1

	Year	Citations

Page 1