SCNN: Sequential convolutional neural network for human action recognition in videos

Abstract

Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) are two typical kinds of neural networks. While CNN models have achieved great success on image recognition due to their strong abilities in abstracting spatial information from multiple levels, RNN models have not achieved significant progress in video analyzing tasks (e.g. action recognition), although RNN can inherently model temporal dependencies from videos. In this work, we propose a Sequential Convolutional Neural Network, denoted as SCNN, to extract effective spatial-temporal features from videos, thus incorporating the strengths of both convolutional operation and recurrent operation. Our SCNN model extends RNN to directly process feature maps, rather than vectors flattened from feature maps, to keep spatial structures of the inputs. It replaces the full connections of RNN with convolutional connections to decrease parameter numbers, computational cost, and over-fitting risk. Moreover, we introduce asymmetric convolutional layers into SCNN to reduce parameter numbers and computational cost further. Our final SCNN deep architecture used for action recognition achieves very good performances on two challenging benchmarks, UCF-101 and HMDB-51, outperforming many state-of-the-art methods.

References

Page 1

	Year	Citations

Page 1