Unsupervised Learning of Video Representations using LSTMs

TLDR

The authors train an LSTM encoder–decoder that maps video sequences to fixed‑length representations, which are decoded to reconstruct or predict future frames using both raw pixel patches and pretrained CNN percepts, and then fine‑tune these representations for action recognition. The learned representations boost action‑recognition accuracy, especially with limited training data, and pretrained models on unrelated 300‑hour YouTube videos also improve performance.

Abstract

We use Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence. We experiment with two kinds of input sequences - patches of image pixels and high-level representations (percepts) of video frames extracted using a pretrained convolutional net. We explore different design choices such as whether the decoder LSTMs should condition on the generated output. We analyze the outputs of the model qualitatively to see how well the model can extrapolate the learned video representation into the future and into the past. We further evaluate the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets. We show that the representations help improve classification accuracy, especially when there are only few training examples. Even models pretrained on unrelated datasets (300 hours of YouTube videos) can help action recognition performance.

References

Page 1

	Year	Citations

Page 1