Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks

TLDR

Most existing video categorization methods fuse multiple features with simple strategies and ignore inter‑class semantic relationships, yet the proposed regularized DNN better harnesses both feature and class relationships to model video semantics. The paper tackles high‑level video categorization and proposes a unified framework that jointly exploits feature and class relationships to improve performance. The framework estimates feature and class relationships and imposes regularizations on a deep neural network, and the authors also release the FCVID dataset of 91,223 Internet videos with 239 categories. The regularized DNN outperforms state‑of‑the‑art methods and achieves competitive results on Hollywood2 and Columbia Consumer Video benchmarks.

Abstract

In this paper, we study the challenging problem of categorizing videos according to high-level semantics such as the existence of a particular human action or a complex event. Although extensive efforts have been devoted in recent years, most existing works combined multiple video features using simple fusion strategies and neglected the utilization of inter-class semantic relationships. This paper proposes a novel unified framework that jointly exploits the feature relationships and the class relationships for improved categorization performance. Specifically, these two types of relationships are estimated and utilized by imposing regularizations in the learning process of a deep neural network (DNN). Through arming the DNN with better capability of harnessing both the feature and the class relationships, the proposed regularized DNN (rDNN) is more suitable for modeling video semantics. We show that rDNN produces better performance over several state-of-the-art approaches. Competitive results are reported on the well-known Hollywood2 and Columbia Consumer Video benchmarks. In addition, to stimulate future research on large scale video categorization, we collect and release a new benchmark dataset, called FCVID, which contains 91,223 Internet videos and 239 manually annotated categories.

References

Page 1

	Year	Citations

Page 1