Transitional Adaptation of Pretrained Models for Visual Storytelling

TLDR

Vision‑to‑language models typically pretrain visual encoders and language generators separately and then jointly fine‑tune them, but this direct transfer can cause a mismatch between visual specificity and language fluency due to the lack of a shared training ground. This study proposes that a transitional adaptation step is necessary between pretraining and fine‑tuning to align the visual encoder and language model for challenging tasks such as visual storytelling. The authors introduce Transitional Adaptation of Pre‑trained Model (TAPM), which aligns multimodal modules through a simple visual‑only alignment task without requiring text labels. Experiments demonstrate that TAPM markedly improves caption quality across multiple language models, achieving state‑of‑the‑art results on LSMDC 2019 and VIST and showing that the gains are independent of the specific language model used.

Abstract

Previous models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pre-trained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.

References

Page 1

	Year	Citations

Page 1