Skip to content

VidTIMIT Deepfake Dataset

Full name: The first publicly available database of GAN-based Deepfake videos

Authors: Pavel Korshunov, Sébastien Marcel (Idiap Research Institute)

Source: Idiap Research Institute

Paper: Korshunov & Marcel (2018)

Access: https://www.idiap.ch/dataset/deepfaketimit

Description

The VidTIMIT Deepfake dataset is the first publicly available database of videos with swapped faces using GAN-based approaches. Created to enable the development and evaluation of Deepfake detection methods, it addresses the absence of public GAN-generated deepfake videos available for research.

Dataset scale and composition

Metric Value
Total videos 620
Subject pairs 16 pairs manually selected for visual similarity
Total subjects 32 (from 43 in VidTIMIT database)
Videos per subject 10 (from VidTIMIT)
Quality versions 2 (low quality, high quality)
Videos per quality 320
Face resolution (LQ) 64×64 pixels
Face resolution (HQ) 128×128 pixels

Quality versions and generation parameters

Low Quality (LQ) Deepfakes: - Input/output size: 64×64 facial regions - Training data: ~200 frames per subject extracted at 4 fps - Training iterations: 100,000 - Training time: ~4 hours per model on Tesla P40 GPU - Blending: CNN-based face segmentation mask + histogram normalization - Visual quality: Noticeably synthetic; easier to detect

High Quality (HQ) Deepfakes: - Input/output size: 128×128 facial regions - Training data: ~400 frames per subject extracted at 8 fps - Training iterations: 200,000 - Training time: ~12 hours per model on Tesla P40 GPU - Blending: Facial landmarks alignment + histogram normalization - Visual quality: Highly realistic; detection significantly harder

Generation methodology

The dataset uses a GAN-based face-swapping algorithm based on publicly available code (autoencoder-based Deepfake approach). For each of the 16 subject pairs, videos were generated in both directions (subject 1→2 and subject 2→1), creating bidirectional swaps. Audio tracks remained unchanged (no speech synthesis).

Subject selection: All 16 pairs were manually selected from the 43 VidTIMIT subjects based on visual similarity in prominent features (mustaches, hair styles) to increase the visual plausibility of face-swaps and make generation more challenging.

Evaluation setup

Face Recognition Vulnerability Assessment: - Tested on state-of-the-art VGG and FaceNet systems - Protocol: 2 genuine videos per subject for enrollment, 8 for probes (licit scenario); 10 Deepfake videos per subject as probes (tampered scenario) - Metric: False Acceptance Rate (FAR) at Equal Error Rate (EER) threshold from licit scenario

Deepfake Detection Evaluation: - Binary classification task: original vs. Deepfake - Dataset split: Train/Test split arranged so same subject doesn't appear in both - Metrics: Equal Error Rate (EER) and FRR at FAR=10%

Key findings from the paper

  1. Face Recognition Vulnerability: VGG and FaceNet systems, which achieve 98.95% and 99.63% accuracy on LFW dataset respectively, fail dramatically on Deepfakes:
  2. VGG: 85.62% FAR on HQ Deepfakes
  3. FaceNet: 95.00% FAR on HQ Deepfakes
  4. Paradoxically, more advanced systems (FaceNet) are more vulnerable

  5. Detection Performance:

  6. Audio-visual lip-sync approach: 41.8% EER on LQ, completely fails on HQ
  7. Image quality metrics (IQM) + SVM: 3.33% EER on LQ, 8.97% EER on HQ
  8. Result indicates lip-sync generation is highly realistic and quality-dependent detection is challenging

Limitations

  • Scale: 620 videos from 32 subjects is relatively small by modern standards
  • Subject selection: Manually selected similar-looking pairs; doesn't represent arbitrary face-swap scenarios
  • Generation methods: Single GAN architecture; doesn't cover other face-swap approaches (Face2Face, etc.)
  • Compression: Videos not subjected to social media compression, which affects detection in real-world scenarios
  • Resolution: Maximum 128×128; lower than modern face-swap methods
  • Source data: VidTIMIT is controlled environment (people facing camera, predetermined phrases); doesn't cover diverse video production conditions

Historical significance

As the first public GAN-based Deepfake dataset, VidTIMIT-Deepfake was instrumental in establishing a benchmark for the field. It demonstrated that: - State-of-the-art face recognition systems are vulnerable to deepfakes - Visual quality is a critical factor in detectability - Audio-visual synchronization alone is insufficient for detection - An arms race exists between generation quality and detection capability

  • FaceForensics++ (2019): 1,000 videos with four manipulation methods including deepfakes; much larger, more comprehensive
  • Celeb-DF (2019): 5,639 fake videos of celebrities; higher visual quality than VidTIMIT-Deepfake
  • DFDC (2020): 128,154 videos from 3,426 actors; largest and most diverse deepfake detection benchmark

Connections

Access and usage

The dataset is freely available from the Idiap Research Institute website. A complete Python implementation and benchmark scores are provided as open source.

Notes

VidTIMIT-Deepfake occupies a unique historical position: it was the first dataset to publicly release GAN-generated deepfakes and comprehensively evaluate their impact on face recognition systems. While newer, larger datasets have superseded it in scale and diversity, the insights from this paper about the arms race between generation quality and detection difficulty remain foundational.