Publication | Closed Access
Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection
197
Citations
56
References
2020
Year
EngineeringMachine LearningInformation ForensicsSpeech RecognitionImage AnalysisData SciencePattern RecognitionDeepfakesAdversarial Machine LearningVideo TransformerVisual Deepfake DetectionComputer ScienceVideo UnderstandingRecurrent Convolutional StructuresHuman Image SynthesisDeep LearningComputer VisionPublic FigureDeepfake DetectionRecurrent Framework
Deepfakes, generated by generative adversarial networks, can defame public figures or sway public opinion, and their realistic quality from ordinary desktop GPUs makes detection increasingly vital for reporters, social media platforms, and the general public. This study introduces simple yet highly effective digital forensic methods for detecting audio spoofing and visual deepfakes. The methods employ convolutional latent representations that capture semantically rich audio and video features, which are then processed by bidirectional recurrent structures and evaluated with entropy‑based cost functions to identify spatial and temporal deepfake signatures. Evaluated on FaceForensics++, Celeb‑DF, and ASVspoof 2019 datasets, the entropy‑based cost functions alone and combined with traditional metrics achieve new state‑of‑the‑art benchmarks, and extensive cross‑domain tests confirm strong generalization and provide insights into the architecture’s effectiveness.
Deepfakes, or artificially generated audiovisual renderings, can be used to defame a public figure or influence public opinion. With the recent discovery of generative adversarial networks, an attacker using a normal desktop computer fitted with an off-the-shelf graphics processing unit can make renditions realistic enough to easily fool a human observer. Detecting deepfakes is thus becoming important for reporters, social media platforms, and the general public. In this work, we introduce simple, yet surprisingly efficient digital forensic methods for audio spoof and visual deepfake detection. Our methods combine convolutional latent representations with bidirectional recurrent structures and entropy-based cost functions. The latent representations for both audio and video are carefully chosen to extract semantically rich information from the recordings. By feeding these into a recurrent framework, we can detect both spatial and temporal signatures of deepfake renditions. The entropy-based cost functions work well in isolation as well as in context with traditional cost functions. We demonstrate our methods on the FaceForensics++ and Celeb-DF video datasets and the ASVSpoof 2019 Logical Access audio datasets, achieving new benchmarks in all categories. We also perform extensive studies to demonstrate generalization to new domains and gain further insight into the effectiveness of the new architectures.
| Year | Citations | |
|---|---|---|
Page 1
Page 1