Publication | Open Access
What Does BERT with Vision Look At?
115
Citations
28
References
2020
Year
Unknown Venue
Language GroundingEngineeringVision LookCorpus LinguisticsNatural Language ProcessingMultimodal LlmImage AnalysisVisual GroundingComputational LinguisticsVisual Question AnsweringLanguage StudiesEntity GroundingMachine TranslationMachine VisionOphthalmologyVision Language ModelPre-trained ModelsVision ResearchDeep LearningComputer VisionVisual FunctionEye TrackingSyntactic GroundingLinguistics
Pre-trained visually grounded language models such as ViLBERT, LXMERT, and UNITER have achieved significant performance improvement on vision-and-language tasks but what they learn during pre-training remains unclear. In this work, we demonstrate that certain attention heads of a visually grounded language model actively ground elements of language to image regions. Specifically, some heads can map entities to image regions, performing the task known as entity grounding. Some heads can even detect the syntactic relations between non-entity words and image regions, tracking, for example, associations between verbs and regions corresponding to their arguments. We denote this ability as syntactic grounding. We verify grounding both quantitatively and qualitatively, using Flickr30K Entities as a testbed.
| Year | Citations | |
|---|---|---|
Page 1
Page 1