The Creation and Detection of Deepfakes: A Survey¶

Authors: Yisroel Mirsky, Wenke Lee
Venue: ACM Computing Surveys, Vol. 1, No. 1, Article 1 (January 2020)
DOI: 10.1145/3425780
arXiv: 2004.11138

TL;DR¶

Comprehensive 38-page survey covering both creation and detection of deepfakes. Systematically reviews generative neural network architectures (GANs, autoencoders, CNNs, RNNs), technical approaches to reenactment, replacement, editing, and synthesis. Catalogs both artifact-specific detection methods (blending artifacts, forensics) and undirected approaches (classification, anomaly detection). Synthesizes an arms race between increasingly sophisticated generation and detection, identifying current limitations and future research directions.

Contributions¶

Unified taxonomy of deepfake creation methods: reenactment (expression/mouth/pose/gaze/body manipulation), replacement (face swapping, transfer), editing (attribute/content modification), and synthesis (from-scratch generation)
Technical foundations systematically explaining GANs, VAEs, encoder-decoders, and other generative architectures used in deepfakes
Comprehensive detection methods survey categorizing artifact-specific approaches (blending, spatial, environmental, forensic) and undirected classifiers (CNN, anomaly)
Challenge identification: generalization across identities/conditions, paired training requirements, identity leakage, occlusion handling, temporal coherence
Current state of deepfake technologies and their practical limitations (quality/speed tradeoffs, execution complexity)
Prevention and mitigation strategies including data provenance, counter-attacks via adversarial perturbations, and proactive defense through crypto/blockchain approaches

Method¶

The survey organizes creation through attack models and generation techniques:

Reenactment: Face attributes are driven by another identity or signal. Expression reenactment uses eye/mouth/face regions; mouth reenactment drives mouth via audio signals or speech; pose/gaze/body reenactment manipulate spatial characteristics. Methods range from 3D morphable models to deep learning approaches (CycleGAN, pix2pix variants, recurrent networks for temporal coherence).

Replacement: Identity is swapped while preserving context. Transfer involves face transfer across outfits; Swap (face-swap) replaces identity while keeping expression. Technical approaches use encoder-decoder networks with identity disentanglement, variational autoencoders for attribute separation, and adversarial training to prevent identity leakage.

Editing & Synthesis: Content is added/altered/removed independently, or faces synthesized de novo. Editing modifies attributes (age, hair, ethnicity) or removes objects. Synthesis creates novel faces using StyleGAN, ProGAN, or other generative models. Approaches include facial action coding systems (FACS), 3D morphable models, and learned intermediate representations.

Technical background: Networks are constructed from building blocks: encoder-decoder networks, GANs (vanilla, conditional, progressive), recurrent architectures (LSTM, GRU), and specialized losses (perceptual, adversarial, L1/L2). Loss functions measure discrepancy at multiple levels: pixel-level (L1, L2), perceptual (VGG feature maps), adversarial (discriminator feedback), cycle-consistency.

Detection¶

The survey catalogs two broad detection strategies:

Artifact-specific approaches exploit imperfections in deepfakes: - Blending artifacts (spatial): discontinuities at face boundaries detected via edge detectors, frequency analysis, or neural learning - Forensic artifacts (spatial): GANs leave fingerprints in frequency domain, unique sensor noise patterns; analyzed via Laplacian pyramids, frequency-domain anomalies - Physiological signals (temporal): absent eye blinking, irregular heart rate variations, inconsistent head poses - Behavioral anomalies (temporal): lip-sync mismatches, unnatural facial motion, temporal flickering; detected via RNN/LSTM on frame sequences, optical flow prediction, synchronization metrics - Coherence artifacts (temporal): global inconsistencies in background, lighting, or head position during scene transitions; detected via spatial-temporal networks

Undirected approaches train classifiers without specifying artifacts: - Classification: CNNs (XceptionNet, EfficientNet) on individual frames; Siamese networks contrasting faces; hierarchical memory networks across temporal windows; ensemble methods combining multiple classifiers - Anomaly detection: Training on "normal" (real) data and flagging statistical outliers; one-class VAEs; reconstruction error-based scoring; attribute-based confidence metrics

Connections¶

Tolosana et al. (2020) — DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection — similar-scope survey emphasizing face-forensics datasets and detection cross-domain robustness
Rana et al. (2022) — Deepfake Detection: A Systematic Literature Review — systematic literature review of detection methods with meta-analysis of detection performance and feature types
Rössler et al. (2019) — FaceForensics++: Learning to Detect Manipulated Facial Images — foundational benchmark dataset with 1.8M+ images across four manipulation methods; key baseline for detection evaluation
Vaccari & Chadwick (2020) — Deepfakes and Disinformation: Exploring the Impact of Synthetic Political Video on Deception, Uncertainty, and Trust in News — experimental study of deepfakes' epistemic effects on political trust and information credibility
Dolhansky et al. (2020) — The DeepFake Detection Challenge (DFDC) Dataset — largest detection benchmark with 128K videos from diverse generation methods; public Kaggle competition
Yang et al. (2018) — Exposing Deep Fakes Using Inconsistent Head Poses — forensic approach exploiting geometric inconsistencies invisible to humans
Li et al. (2018) — In Ictu Oculi: Exposing AI Generated Fake Face Videos by Detecting Eye Blinking — physiological signal-based detection via absent blink patterns in training data
Deepfakes (topic) — broader deepfakes research synthesis across creation, detection, ethical implications

Notes¶

Strengths: - Comprehensive coverage of both creation and detection with equal technical depth - Clear taxonomy and schematics of neural network architectures - Early survey (2020) capturing the state of the field before rapid 2021+ advances in face synthesis quality - Identifies open challenges (generalization, paired training, temporal coherence) that remain relevant

Weaknesses/Gaps: - Limited discussion of face-forensics datasets (FaceForensics++ mentioned but not deeply analyzed) - Detection methods surveyed are relatively shallow—limited coverage of state-of-the-art ensemble approaches or transformer-based detectors - Deepfakes in audio/speech and non-facial domains covered briefly; primarily faces - No systematic evaluation/ranking of detection methods by performance across datasets - Published early in the deepfake timeline—methods and datasets have proliferated significantly since 2020

Relevance to fake news wiki: This survey is central to understanding deepfakes as a misinformation and disinformation vector. While the wiki previously covered deepfakes detection papers and individual datasets, this comprehensive creation+detection survey provides the systematic foundation for understanding how adversaries create convincing synthetic media and how defenders can detect it. Critical for researchers building countermeasures and for policymakers understanding the technological capabilities and limitations.