Publication | Closed Access
Self-Supervised Vision-Language Pretraining for Medial Visual Question Answering
63
Citations
17
References
2023
Year
Unknown Venue
Natural Language ProcessingMultimodal LlmImage Text AlignmentImage AnalysisMachine LearningEngineeringText-to-image RetrievalPattern RecognitionVisual GroundingMedical Image ComputingVision Language ModelMasked Language ModelingVisual Question AnsweringSelf-supervised Vision-language PretrainingDeep LearningRadiographic ImageComputer VisionMachine Translation
Medical image visual question answering (VQA) is a task to answer clinical questions, given a radiographic image, which is a challenging problem that requires a model to integrate both vision and language information. Due to the limited size of the training data in medical VQA tasks, pretrain-finetune paradigm is a commonly used solution to improve the model generalization. In this paper, we propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining on medical image caption dataset, and finetunes to downstream medical VQA tasks. The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets. Our codes and models are available at https://github.com/pengfeiliHEU/M2I2.
| Year | Citations | |
|---|---|---|
Page 1
Page 1