Deepfake Detection: A Systematic Literature Review¶

Authors: MD Shohel Rana, Mohammad Nur Nobi, Beddhu Murali, Andrew H. Sung

Venue: IEEE Access, 2022 — DOI

TL;DR¶

Systematic review of 112 deepfake detection studies (2018–2020) categorizing techniques into deep learning, machine learning, statistical, and blockchain methods. Deep learning approaches (77% of studies, primarily CNNs) significantly outperform non-deep methods, achieving ~89.7% mean accuracy. FaceForensics++ is the dominant benchmark dataset.

Contributions¶

Comprehensive taxonomy of deepfake detection techniques organized into four categories: deep learning-based (77%), machine learning-based (18%), statistical (3%), and blockchain-based (2%)
Systematic analysis of 112 peer-reviewed papers from 2018 to 2020 across multiple publication venues
Evaluation and comparison of detection models, feature extraction methods, and measurement metrics
Synthesis of commonly used deepfake datasets and their characteristics
Evidence-based performance comparison showing deep learning superiority

Method¶

The authors conducted a systematic literature review (SLR) following the Budgen and Brereton SLR framework. They identified research questions around deepfake detection techniques (RQ-1), empirical testing procedures (RQ-2), classification frameworks (RQ-3), and comparative efficacy (RQ-4, RQ-5). Search strategy combined Boolean logic queries across 10 digital repositories (IEEE Xplore, ACM Digital Library, SpringerLink, Semantic Scholar, etc.) with studies limited to 2018–2020 and English-language publications. Quality assessment criteria ensured selection of rigorous empirical studies with explicit methodology and results reporting.

Results¶

Detection techniques: Deep learning dominates, with 70 of 91 studies (77%) employing deep learning-based methods. CNNs are the most prevalent architecture (32% of studies), followed by RNNs (including LSTM), and hybrid approaches. Traditional machine learning (SVM, Random Forest, k-NN) comprises 18% of studies; statistical methods account for 3%; blockchain-based approaches, 2%.

Models and features: Most deep learning approaches extract spatial-temporal features from face and video regions. Common CNN architectures include VGG, ResNet, Inception, and specialized networks (FaceNet, MesoNet). Feature categories span visual artifacts, frequency domain analysis, special artifacts, texture/spatio-temporal consistency, and facial landmarks. Biological signals (eye blinking, heartbeat patterns) supplement traditional features.

Datasets: FaceForensics++ (FF++) appears in 52 studies, DFDC in 23, and DeeperForensics in 11. Most datasets use volunteer actors and scripted videos; some employ synthesized deepfakes via popular software (DeepFaceLab, FaceSwap).

Performance: Deep learning methods achieve mean accuracy of 89.73% and AUC of 0.917; machine learning-based methods reach 85.00% accuracy and 0.900 AUC. Standard metrics include accuracy, AUC, recall, precision, and F1-score. Inconsistent evaluation practices across studies (different dataset splits, metrics, sample sizes) complicate comparative claims.

Connections¶

Related to FaceForensics++, the most widely used benchmark dataset
Extended by DFDC dataset, another canonical deepfake detection benchmark
Complements Vaccari & Starbird (2020) on disinformation implications of synthetic media
Shares methodology with DetectGPT on AI-generated content detection

Notes¶

This is the most comprehensive systematic review of deepfake detection literature at the time of publication. Strengths include rigorous SLR methodology, large sample size (112 papers), structured taxonomy, and clear evidence for deep learning effectiveness. The review identifies critical gaps: lack of standardized evaluation frameworks across studies, small deepfake datasets limiting generalization, and the arms race between generation and detection techniques. The finding that inconsistent dataset usage and metrics undermine reproducible comparisons remains a challenge for the field. The paper's limitation analysis (construct validity, internal validity, external validity) is thorough and highlights the need for unified frameworks. Notably, the review covers only published research through 2020 and does not include subsequent advances in foundation models and transformer-based detection methods.