Skip to content
The Creation and Detection of Deepfakes: A Survey

The Creation and Detection of Deepfakes: A Survey

Authors: Yisroel Mirsky, Wenke Lee
Venue: ACM Computing Surveys, Vol. 1, No. 1, Article 1 (January 2020)
DOI: 10.1145/3425780
arXiv: 2004.11138

TL;DR

Comprehensive 38-page survey covering both creation and detection of deepfakes. Systematically reviews generative neural network architectures (GANs, autoencoders, CNNs, RNNs), technical approaches to reenactment, replacement, editing, and synthesis. Catalogs both artifact-specific detection methods (blending artifacts, forensics) and undirected approaches (classification, anomaly detection). Synthesizes an arms race between increasingly sophisticated generation and detection, identifying current limitations and future research directions.

Contributions

  • Unified taxonomy of deepfake creation methods: reenactment (expression/mouth/pose/gaze/body manipulation), replacement (face swapping, transfer), editing (attribute/content modification), and synthesis (from-scratch generation)
  • Technical foundations systematically explaining GANs, VAEs, encoder-decoders, and other generative architectures used in deepfakes
  • Comprehensive detection methods survey categorizing artifact-specific approaches (blending, spatial, environmental, forensic) and undirected classifiers (CNN, anomaly)
  • Challenge identification: generalization across identities/conditions, paired training requirements, identity leakage, occlusion handling, temporal coherence
  • Current state of deepfake technologies and their practical limitations (quality/speed tradeoffs, execution complexity)
  • Prevention and mitigation strategies including data provenance, counter-attacks via adversarial perturbations, and proactive defense through crypto/blockchain approaches

Method

The survey organizes creation through attack models and generation techniques:

Reenactment: Face attributes are driven by another identity or signal. Expression reenactment uses eye/mouth/face regions; mouth reenactment drives mouth via audio signals or speech; pose/gaze/body reenactment manipulate spatial characteristics. Methods range from 3D morphable models to deep learning approaches (CycleGAN, pix2pix variants, recurrent networks for temporal coherence).

Replacement: Identity is swapped while preserving context. Transfer involves face transfer across outfits; Swap (face-swap) replaces identity while keeping expression. Technical approaches use encoder-decoder networks with identity disentanglement, variational autoencoders for attribute separation, and adversarial training to prevent identity leakage.

Editing & Synthesis: Content is added/altered/removed independently, or faces synthesized de novo. Editing modifies attributes (age, hair, ethnicity) or removes objects. Synthesis creates novel faces using StyleGAN, ProGAN, or other generative models. Approaches include facial action coding systems (FACS), 3D morphable models, and learned intermediate representations.

Technical background: Networks are constructed from building blocks: encoder-decoder networks, GANs (vanilla, conditional, progressive), recurrent architectures (LSTM, GRU), and specialized losses (perceptual, adversarial, L1/L2). Loss functions measure discrepancy at multiple levels: pixel-level (L1, L2), perceptual (VGG feature maps), adversarial (discriminator feedback), cycle-consistency.

Detection

The survey catalogs two broad detection strategies:

Artifact-specific approaches exploit imperfections in deepfakes: - Blending artifacts (spatial): discontinuities at face boundaries detected via edge detectors, frequency analysis, or neural learning - Forensic artifacts (spatial): GANs leave fingerprints in frequency domain, unique sensor noise patterns; analyzed via Laplacian pyramids, frequency-domain anomalies - Physiological signals (temporal): absent eye blinking, irregular heart rate variations, inconsistent head poses - Behavioral anomalies (temporal): lip-sync mismatches, unnatural facial motion, temporal flickering; detected via RNN/LSTM on frame sequences, optical flow prediction, synchronization metrics - Coherence artifacts (temporal): global inconsistencies in background, lighting, or head position during scene transitions; detected via spatial-temporal networks

Undirected approaches train classifiers without specifying artifacts: - Classification: CNNs (XceptionNet, EfficientNet) on individual frames; Siamese networks contrasting faces; hierarchical memory networks across temporal windows; ensemble methods combining multiple classifiers - Anomaly detection: Training on "normal" (real) data and flagging statistical outliers; one-class VAEs; reconstruction error-based scoring; attribute-based confidence metrics

Connections

Notes

Strengths: - Comprehensive coverage of both creation and detection with equal technical depth - Clear taxonomy and schematics of neural network architectures - Early survey (2020) capturing the state of the field before rapid 2021+ advances in face synthesis quality - Identifies open challenges (generalization, paired training, temporal coherence) that remain relevant

Weaknesses/Gaps: - Limited discussion of face-forensics datasets (FaceForensics++ mentioned but not deeply analyzed) - Detection methods surveyed are relatively shallow—limited coverage of state-of-the-art ensemble approaches or transformer-based detectors - Deepfakes in audio/speech and non-facial domains covered briefly; primarily faces - No systematic evaluation/ranking of detection methods by performance across datasets - Published early in the deepfake timeline—methods and datasets have proliferated significantly since 2020

Relevance to fake news wiki: This survey is central to understanding deepfakes as a misinformation and disinformation vector. While the wiki previously covered deepfakes detection papers and individual datasets, this comprehensive creation+detection survey provides the systematic foundation for understanding how adversaries create convincing synthetic media and how defenders can detect it. Critical for researchers building countermeasures and for policymakers understanding the technological capabilities and limitations.