Omnivore: A Single Model for Many Visual Modalities

Abstract

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our ‘OMNIVORE’ model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivoreis simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivoremodel obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. OMNIVORE's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.

References

Page 1

	Year	Citations
Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Image ClassificationDeep Neural NetworksMachine VisionImage AnalysisMachine Learning	2016	214.9K
ImageNet classification with deep convolutional neural networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Communications of the ACM Convolutional Neural NetworkEngineeringMachine LearningNeural NetworkImagenet Classification	2017	75.5K
MizAR 60 for Mizar 50 DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2023	73.5K
Gradient-based learning applied to document recognition Yann LeCun, Léon Bottou, Yoshua Bengio, Proceedings of the IEEE EngineeringMachine LearningMultilayer Neural NetworksImage AnalysisData Science	1998	56.5K
Distinctive Image Features from Scale-Invariant Keypoints David Lowe International Journal of Computer Vision Machine VisionImage AnalysisFeature DetectionEngineeringPattern Recognition	2004	54.6K
Going deeper with convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Image ClassificationDeep Neural NetworksImage AnalysisMachine LearningData Science	2015	46.2K
ImageNet Large Scale Visual Recognition Challenge Olga Russakovsky, Jia Deng, Hao Su, International Journal of Computer Vision Image ClassificationConvolutional Neural NetworkMachine VisionImage AnalysisEngineering	2015	39.5K
Dropout: a simple way to prevent neural networks from overfitting Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky,	2014	34.2K
Rethinking the Inception Architecture for Computer Vision Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Convolutional Neural NetworkEngineeringMachine LearningInception ArchitectureImage Classification	2016	30.2K
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Ze Liu, Yutong Lin, Yue Cao, 2021 IEEE/CVF International Conference on Computer Vision (ICCV) Swin TransformerConvolutional Neural NetworkMachine VisionImage AnalysisMachine Learning	2021	27.9K

Page 1