Sniffer: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection

Abstract

Misinformation is a prevalent societal issue due to its potential high risks. Out-Of-Context (OOC) misinformation, where authentic images are repurposed with false text, is one of the easiest and most effective ways to mislead audiences. Current methods focus on assessing image- text consistency but lack convincing explanations for their judgments, which are essential for debunking misinformation. While Multimodal Large Language Models (MLLMs) have rich knowledge and innate capability for visual rea- soning and explanation generation, they still lack sophisti- cation in understanding and discovering the subtle cross- modal differences. In this paper, we introduce Sniffer,a novel multimodal large language model specifically engi- neered for OOC misinformation detection and explanation. Snifferemploys two-stage instruction tuning on Instruct- BLIP. The first stage refines the model's concept alignment of generic objects with news-domain entities and the sec- ond stage leverages OOC-specific instruction data gener- ated by language-only GPT-4 to fine-tune the model's dis- criminatory powers. Enhanced by external tools and re- trieval, Sniffernot only detects inconsistencies between text and image but also utilizes external knowledge for con- textual verification. Our experiments show that Sniffersurpasses the original MLLM by over 40% and outperforms state-of-the-art methods in detection accuracy. Snifferalso provides accurate and persuasive explanations as val- idated by quantitative and human evaluations.

References

Page 1

	Year	Citations

Page 1