VidTIMIT Deepfake Dataset¶

Full name: The first publicly available database of GAN-based Deepfake videos

Authors: Pavel Korshunov, Sébastien Marcel (Idiap Research Institute)

Source: Idiap Research Institute

Paper: Korshunov & Marcel (2018)

Access: https://www.idiap.ch/dataset/deepfaketimit

Description¶

The VidTIMIT Deepfake dataset is the first publicly available database of videos with swapped faces using GAN-based approaches. Created to enable the development and evaluation of Deepfake detection methods, it addresses the absence of public GAN-generated deepfake videos available for research.

Dataset scale and composition¶

Metric	Value
Total videos	620
Subject pairs	16 pairs manually selected for visual similarity
Total subjects	32 (from 43 in VidTIMIT database)
Videos per subject	10 (from VidTIMIT)
Quality versions	2 (low quality, high quality)
Videos per quality	320
Face resolution (LQ)	64×64 pixels
Face resolution (HQ)	128×128 pixels

Quality versions and generation parameters¶

Low Quality (LQ) Deepfakes: - Input/output size: 64×64 facial regions - Training data: ~200 frames per subject extracted at 4 fps - Training iterations: 100,000 - Training time: ~4 hours per model on Tesla P40 GPU - Blending: CNN-based face segmentation mask + histogram normalization - Visual quality: Noticeably synthetic; easier to detect

High Quality (HQ) Deepfakes: - Input/output size: 128×128 facial regions - Training data: ~400 frames per subject extracted at 8 fps - Training iterations: 200,000 - Training time: ~12 hours per model on Tesla P40 GPU - Blending: Facial landmarks alignment + histogram normalization - Visual quality: Highly realistic; detection significantly harder

Generation methodology¶

The dataset uses a GAN-based face-swapping algorithm based on publicly available code (autoencoder-based Deepfake approach). For each of the 16 subject pairs, videos were generated in both directions (subject 1→2 and subject 2→1), creating bidirectional swaps. Audio tracks remained unchanged (no speech synthesis).

Subject selection: All 16 pairs were manually selected from the 43 VidTIMIT subjects based on visual similarity in prominent features (mustaches, hair styles) to increase the visual plausibility of face-swaps and make generation more challenging.

Evaluation setup¶

Face Recognition Vulnerability Assessment: - Tested on state-of-the-art VGG and FaceNet systems - Protocol: 2 genuine videos per subject for enrollment, 8 for probes (licit scenario); 10 Deepfake videos per subject as probes (tampered scenario) - Metric: False Acceptance Rate (FAR) at Equal Error Rate (EER) threshold from licit scenario

Deepfake Detection Evaluation: - Binary classification task: original vs. Deepfake - Dataset split: Train/Test split arranged so same subject doesn't appear in both - Metrics: Equal Error Rate (EER) and FRR at FAR=10%

Key findings from the paper¶

Face Recognition Vulnerability: VGG and FaceNet systems, which achieve 98.95% and 99.63% accuracy on LFW dataset respectively, fail dramatically on Deepfakes:
VGG: 85.62% FAR on HQ Deepfakes
FaceNet: 95.00% FAR on HQ Deepfakes
Paradoxically, more advanced systems (FaceNet) are more vulnerable
Detection Performance:
Audio-visual lip-sync approach: 41.8% EER on LQ, completely fails on HQ
Image quality metrics (IQM) + SVM: 3.33% EER on LQ, 8.97% EER on HQ
Result indicates lip-sync generation is highly realistic and quality-dependent detection is challenging

Limitations¶

Scale: 620 videos from 32 subjects is relatively small by modern standards
Subject selection: Manually selected similar-looking pairs; doesn't represent arbitrary face-swap scenarios
Generation methods: Single GAN architecture; doesn't cover other face-swap approaches (Face2Face, etc.)
Compression: Videos not subjected to social media compression, which affects detection in real-world scenarios
Resolution: Maximum 128×128; lower than modern face-swap methods
Source data: VidTIMIT is controlled environment (people facing camera, predetermined phrases); doesn't cover diverse video production conditions

Historical significance¶

As the first public GAN-based Deepfake dataset, VidTIMIT-Deepfake was instrumental in establishing a benchmark for the field. It demonstrated that: - State-of-the-art face recognition systems are vulnerable to deepfakes - Visual quality is a critical factor in detectability - Audio-visual synchronization alone is insufficient for detection - An arms race exists between generation quality and detection capability

FaceForensics++ (2019): 1,000 videos with four manipulation methods including deepfakes; much larger, more comprehensive
Celeb-DF (2019): 5,639 fake videos of celebrities; higher visual quality than VidTIMIT-Deepfake
DFDC (2020): 128,154 videos from 3,426 actors; largest and most diverse deepfake detection benchmark

Connections¶

DeepFakes: a New Threat to Face Recognition? Assessment and Detection — The paper introducing this dataset
Deepfakes (topic) — Broader context on deepfake detection and vulnerability
Synthetic media detection (topic) — Detection methods and challenges
Face recognition (topic) — Vulnerability of face recognition systems

Access and usage¶

The dataset is freely available from the Idiap Research Institute website. A complete Python implementation and benchmark scores are provided as open source.

Notes¶

VidTIMIT-Deepfake occupies a unique historical position: it was the first dataset to publicly release GAN-generated deepfakes and comprehensively evaluate their impact on face recognition systems. While newer, larger datasets have superseded it in scale and diversity, the insights from this paper about the arms race between generation quality and detection difficulty remain foundational.