Concepedia

Publication | Closed Access

Omnivore: A Single Model for Many Visual Modalities

174

Citations

69

References

2022

Year

Abstract

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our ‘OMNIVORE’ model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivoreis simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivoremodel obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. OMNIVORE's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.

References

YearCitations

2016

214.9K

2017

75.5K

2023

73.5K

1998

56.5K

2004

54.6K

2015

46.2K

2015

39.5K

2014

34.2K

2016

30.2K

2021

27.9K

Page 1