Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

TLDR

Cross‑lingual pre‑trained models such as XLM and Unicoder inspire the design of a vision‑language encoder. The authors introduce Unicoder‑VL, a universal encoder that learns joint vision‑language representations through pre‑training. Unicoder‑VL feeds visual and linguistic inputs into a multi‑layer Transformer and trains it on three tasks—Masked Language Modeling, Masked Object Classification, and Visual‑Linguistic Matching—to produce context‑aware joint representations. After pre‑training on large image‑caption datasets, Unicoder‑VL achieves state‑of‑the‑art or comparable performance on caption‑based image‑text retrieval and visual commonsense reasoning with only a single output layer.

Abstract

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training.

References

Page 1

	Year	Citations

Page 1