Large-Scale Video Classification with Convolutional Neural Networks

TLDR

Convolutional Neural Networks (CNNs) have proven to be powerful models for image recognition tasks. The study evaluates CNNs on large‑scale video classification using a new dataset of 1 million YouTube videos spanning 487 classes. The authors extend CNN connectivity into the temporal domain, propose a multiresolution foveated architecture to accelerate training, and fine‑tune the top layers on the UCF‑101 dataset to test generalization. Spatio‑temporal CNNs reach 63.9% accuracy, an 8.6‑point improvement over feature‑based baselines but only a 1.6‑point gain over single‑frame models, and fine‑tuning on UCF‑101 boosts accuracy from 43.9% to 63.3%.

Abstract

Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).

References

Page 1

	Year	Citations

Page 1