MesoNet: A Compact Facial Video Forgery Detection Network¶

Authors: Darius Afchar, Vincent Nozick, Junichi Yamagishi, Isao Echizen Venue: arXiv, 2018 — arXiv:1809.00888

TL;DR¶

This paper proposes two lightweight CNN architectures (Meso-4 and MesoInception-4) to detect facial video forgeries created by Deepfake and Face2Face techniques. Operating at the mesoscopic level of image analysis, the networks achieve 98% detection accuracy for Deepfake and 95% for Face2Face videos under realistic compression conditions. The authors also introduce the first publicly available Deepfake detection dataset.

Contributions¶

Two efficient CNN architectures specifically designed for deepfake and Face2Face video detection at the mesoscopic level
First publicly available dataset of videos generated with the Deepfake technique
Demonstration that lightweight networks with few parameters (~28K) outperform more complex architectures for this task
Analysis of network robustness to video compression at different levels
Interpretability analysis showing that eyes and mouth features are critical for detecting Deepfake forgeries

Method¶

Detection Strategy: The authors position their method at the mesoscopic level of analysis—intermediate between microscopic signal-based analysis (which degrades with video compression) and semantic-level analysis (where humans struggle to distinguish forged faces). This avoids the compression artifacts that destroy traditional forensic signals while remaining below the semantic complexity where human perception fails.

Meso-4 Architecture: A simple four-layer convolutional network followed by a dense layer with one hidden unit. Uses ReLU activations and batch normalization for regularization. Total of 27,977 trainable parameters. Input size 256×256×3, outputs binary classification (forged/real).

MesoInception-4 Architecture: Replaces the first two convolutional layers of Meso-4 with modified inception modules using 3×3 dilated convolutions (rather than 5×5) to avoid high semantic information. Includes 1×1 convolutions for dimension reduction and skip connections. Total of 28,615 trainable parameters.

Training: Both networks trained with Adam optimizer (learning rate 10^-3 decaying to 10^-6 over 1000 iterations). Batch size 75, input images 256×256×3. Data augmentation via random zoom, rotation, horizontal flips, brightness and hue changes. Training completes in hours on consumer-grade hardware.

Results¶

Deepfake Detection: On their collected dataset of 175 forged videos and real faces: - Meso-4: 89.1% accuracy (per-frame), 96.9% with video aggregation - MesoInception-4: 91.7% accuracy (per-frame), 98.4% with video aggregation

Face2Face Detection: On the FaceForensics dataset with H.264 compression: - Compression level 0 (lossless): 96.8% (MesoInception-4) - Compression level 23 (light): 93.4% (MesoInception-4) - Compression level 40 (strong): 81.3% (MesoInception-4)

Video-level aggregation (averaging predictions across frames) significantly improves performance, achieving 95.3% on Face2Face at moderate compression.

Connections¶

Related to FaceForensics++ as a baseline detection method on their standardized dataset
Cited by Deepfake Detection Survey as foundational work in CNN-based deepfake detection
Relevant to Deepfake Detection techniques and Facial manipulation detection methods
Builds on Face Reenactment understanding by detecting Face2Face forgeries

Notes¶

Strengths: - Demonstrates that lightweight networks can match or exceed complex architectures for this task, making detection practical - First systematic analysis of Deepfake detection; the Deepfake tool had not been published academically at this time - Good interpretation of what networks learn: eyes and facial details are preserved in real faces but blurred in Deepfakes due to autoencoder compression - Addresses practical video compression effects, showing robustness even at strong compression levels

Limitations: - Deepfake dataset is relatively small (175 videos) compared to modern benchmarks - Performance degrades significantly at high compression levels (81% for Face2Face at level 40) - No comparison with other deepfake detection methods (as none existed at publication time) - Limited to detecting two specific forgery techniques; generalization to other deepfake tools unclear

Follow-ups: - Later work would explore ensemble methods and cross-compression generalization - The dataset approach influenced subsequent work on FaceForensics and other large-scale forgery benchmarks - MesoNet architectures became baseline detectors for many subsequent deepfake papers