VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Abstract

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark.

References

Page 1

	Year	Citations
ImageNet: A large-scale hierarchical image database Jia Deng, Wei Dong, Richard Socher, 2009 IEEE Conference on Computer Vision and Pattern Recognition EngineeringMachine LearningImage RetrievalImage DatabaseImage Recognition (Computer Vision)	2009	60.2K
Fully convolutional networks for semantic segmentation Jonathan Long, Evan Shelhamer, Trevor Darrell	2015	36.2K
Glove: Global Vectors for Word Representation Jeffrey Pennington, Richard Socher, Christopher D. Manning EngineeringMachine LearningVector SpaceCorpus LinguisticsText Mining	2014	33.2K
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Convolutional Neural NetworkEngineeringMachine LearningFeature DetectionRich Feature Hierarchies	2014	31.2K
Fast R-CNN Ross Girshick Image ClassificationConvolutional Neural NetworkImage AnalysisMachine LearningMachine Vision	2015	27.2K
HISTORIAE, History of Socio-Cultural Transformation as Linguistic Data Science. A Humanities Use Case Yinhan Liu, Myle Ott, Naman Goyal, DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)	2019	17.1K
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov, Kai Chen, Greg S. Corrado, arXiv (Cornell University)	2013	11.7K
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Yonghui Wu, Mike Schuster, Zhifeng Chen, arXiv (Cornell University) Artificial IntelligenceNatural Language ProcessingCoverage PenaltyComputer-assisted TranslationEngineering	2016	5.6K
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations Ranjay Krishna, Yuke Zhu, Oliver Groth, International Journal of Computer Vision	2017	5.1K
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Peter Anderson, Xiaodong He, Chris Buehler, EngineeringMachine LearningAttentionSocial SciencesNatural Language Processing	2018	5K

Page 1