Enhancing multimodal deepfake detection with local–global feature integration and diffusion models

Abstract

Abstract Deepfake detection has become a critical challenge nowadays with the rise of sophisticated generative techniques that manipulate audio-visual data. Existing methods primarily focus on lip movement synchronization using audio and visual features, often relying on local feature extraction with Convolutional Neural Networks (CNNs). In this work, we propose an enhanced multimodal framework that integrates with local and global features for advanced deepfake detection. Our approach extends traditional pipelines by introducing additional visual features such as eye movement and facial regions, combined with audio features to model cross-modal dependencies. While CNNs capture local features, Vision Transformers (ViTs) extract global contextual relationships from both visual and audio modalities. The diffusion models are incorporated as pre-processors to refine noisy data and generate realistic augmentations, ensuring high-quality feature representation. The proposed framework achieves state-of-the-art performance, with accuracy scores of 0.9987, 0.9825, 0.9915, and 0.9812 on the FakeAVCeleb, AV-Deepfake1M, TVIL, and LAV-DF datasets, respectively. These results demonstrate significant improvements over existing methods, highlighting the framework’s superior generalization and robustness in detecting subtle inconsistencies across manipulated audio-visual data.

References

Page 1

	Year	Citations

Page 1