DFDC: DeepFake Detection Challenge Dataset¶

Full name: The DeepFake Detection Challenge (DFDC) Dataset

Authors: Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, Cristian Canton Ferrer

Source: Facebook AI

Paper: Dolhansky et al. (2020), CVPR '20

Access: https://ai.facebook.com/datasets/dfdc

Description¶

The DFDC Dataset is the largest publicly available deepfake detection benchmark, designed to advance the state of deepfake detection research through a large-scale, ethically-constructed face-swap video dataset with explicit participant consent.

The dataset addresses the limitations of prior deepfake datasets (FaceForensics++, Celeb-DF, UADFV, etc.), which suffered from: - Small number of unique identities (limiting generalization) - Lack of explicit participant consent and agreement - Limited representation of face-swap generation methods - Unclear ethical status and uncertain use rights

In contrast, DFDC provides a large, consented dataset with diversity in generation methods and quality levels.

Dataset scale and composition¶

Metric	Value
Total videos	128,154
Unique subjects/actors	3,426 paid actors (consented)
Total clips	100,000+ face-swapped clips
Raw data volume	>25 TB
Average video length	68.88 seconds
Total footage	38.4 days
Frame resolution	1080p H.264

Dataset splits and statistics¶

Training set: - 119,154 ten-second video clips - 486 unique subjects - 100,000 clips contain synthetic Deepfakes - 83.9% created with dataset-provided Deepfakes method; 16.1% with other methods (DFAE, MM/NN, NTH, FSGAN, StyleGAN) - No augmentations applied

Validation set (public test): - 4,000 ten-second video clips - 50% Deepfakes, 50% real videos - 214 unique subjects - Unseen augmentation-free generation methods - Used for Kaggle competition leaderboard (public ranking)

Test set (private, final evaluation): - 10,000 ten-second video clips - 50% Deepfakes, 50% real videos - 260 unique subjects (half never seen in training) - Augmentations applied to approximately 79% of videos - Used for final Kaggle competition scoring

Deepfake generation methods¶

The dataset includes videos created with multiple face-swap and generative methods:

DFAE (Deep Face Auto-Encoder): Standard convolutional autoencoder architecture with separate coders per identity; pixel-shuffling operators for spatial dimension alignment; trained on in-distribution studio video data.
MM/NN (Morphable Model / Nearest Neighbor): Facial landmarks extracted via manual annotation; nearest-neighbor face-morphing to match target identity landmarks; post-processing with illumination-harmonic spherical harmonics for realistic shading.
NTH (Neural Talking Head): Meta-parameter learning on low resolution input (128×128); replicates talking head models trained on limited data; NTH training requires minimal GPU resources.
FSGAN (Face-Swapping Generative Adversarial Network): Fully-described in cited work; uses GANs for pose/expression variation adaptation; adversarial loss on appearance and technical quality.
StyleGAN-based refinement: Leverages StyleGAN architecture for high-fidelity face generation; applied selectively to post-process deepfake output, increasing perceptual quality.

Video recording and augmentation¶

Recording conditions: - 3,426 subjects filmed in natural settings (outdoor and indoor) - High-resolution cameras - Pre-processing: face tracking and alignment - Raw video resolution: HD; downsampled to 256×256 pixels for deepfake generation

Augmentation strategy: - Distractors: overlaid images, shapes, and text (social media logos, dog/flower crown filters) - Augmenters: geometric transforms, color adjustments, frame-rate changes, grayscale conversion, horizontal flipping, audio removal, additive noise, encoder quality reduction, rotation, image overlay with transparency - Augmentation applied to ~79% of final test set videos - Purpose: test model robustness to distribution shift without requiring large-scale computing

Ethical design¶

All videos were created with: - Explicit informed consent from 3,426 paid actors - Agreements authorizing use of participants' likenesses in face-manipulated videos - Clear documentation of which videos contain manipulated faces (no deception) - Designed for research and detection only—not to deceive or to be shared on social media

Benchmark results and findings¶

The paper presents two major findings from Kaggle competition submissions:

Detection is difficult: Despite state-of-the-art computer vision and deepfake detection methods, the best Kaggle submissions achieve only moderate accuracy on held-out test videos, showing that deepfake detection remains an unsolved problem.
Augmentation challenge: Models trained without augmentation show significant performance degradation when evaluated on augmented test videos, highlighting the challenge of building robust detectors that generalize to distribution shifts.

Connections¶

Vaccari & Chadwick (2020) — Studies the psychological impact of deepfakes on trust and uncertainty; DFDC dataset enables research into detecting the videos covered in that work.
Deepfakes (topic page) — Provides broader context on deepfake detection methods and challenges; DFDC is the largest detection benchmark in this domain.
Synthetic media (topic page) — Places deepfakes within the broader context of AI-generated content.
Fagni et al. (2020) — TweepFake — Parallel detection benchmark for machine-generated text; similar challenge structure and goal.

FaceForensics++ (FF++): 1,000 videos, 500K frames; dominated by YouTube sources with potential consent issues
Celeb-DF: 5,639 fake, 6,229 real videos; limited to celebrity subjects; prior generation methods less diverse
Google DFD Dataset: 3,000 fake, 3,000 real videos; limited subject diversity
UADFV: 49 fake, 49 real videos; first-generation dataset; too small for learning

Limitations and bias¶

Source data: All subjects consented and were filmed in controlled studio settings; doesn't cover full diversity of real-world video production (news broadcasts, surveillance, mobile video)
Augmentation: Represents only a subset of possible post-processing and corruption scenarios (compression, transcoding, display artifacts)
Generation methods: Limited to methods available as of 2020; newer face-swap and diffusion-based methods not represented
Scale concern: With ~128K videos and 3.4K identities, the average identity has ~37 videos; may lead to overfitting on identity features rather than forgery artifacts

Usage and access¶

The DFDC dataset is hosted on the Facebook AI website and requires registration. Data is released for research and detection purposes only. The paper includes instructions for downloading, decompressing, and organizing the dataset.

Notes¶

The DFDC Dataset represents a significant contribution to deepfake detection research by providing (1) a large, ethically-consented dataset, (2) explicit diversity in face-swap generation methods, (3) a public Kaggle competition for real-time benchmarking, and (4) a systematic treatment of augmentation robustness. However, like all deepfake datasets, it reflects studio-controlled video production conditions and may not fully capture the characteristics of adversarial deepfakes designed to evade detection in unconstrained settings.