Two-Stream Neural Networks for Tampered Face Detection¶

Authors: Peng Zhou, Xintong Han, Vlad I. Morariu, Larry S. Davis
Institution: University of Maryland, College Park
ArXiv: 1803.11276 — Link

TL;DR¶

The paper proposes a two-stream neural network for detecting face tampering (face swaps) in images. One stream uses GoogLeNet to classify tampered faces by visual artifacts, while the second stream uses steganalysis features with a triplet network to capture low-level noise inconsistencies. The approach achieves 0.927 AUC and introduces the SwapMe/FaceSwap dataset with 2010 high-quality tampered face images.

Contributions¶

Two-stream architecture combining high-level visual artifact detection with low-level noise residual detection for face tampering
Patch-based triplet network refinement of steganalysis features to better capture in-camera noise and CFA patterns
SwapMe and FaceSwap dataset: 2010 tampered images created using two different face-swapping algorithms, with diverse identities and realistic post-processing
Demonstration that complementary evidence streams improve robustness to post-processing techniques (resizing, blurring)

Method¶

The two-stream architecture captures different evidence of tampering:

Face Classification Stream: Uses GoogLeNet (Inception V3) fine-tuned to classify whether a face is tampered or authentic. The network learns high-level tampering artifacts such as stitching artifacts near boundaries, unnatural edges around lips, and blurring effects. Input faces are resized to 299×299.

Patch Triplet Stream: Extracts steganalysis features (CFA-aware features capturing local noise residuals) from image patches and refines them using a triplet loss network. The triplet loss ensures patches from the same image cluster close together in embedding space while patches from different images are far apart. This forces the network to learn camera and noise characteristics. For a test image, patches are extracted using a sliding window (128×128 patch, 64-pixel stride). An SVM is trained on-the-fly for each test image to classify patches as tampered (from a different image) or authentic (from the same image).

Fusion: Final tampering score combines GoogLeNet classification output F(q) with averaged SVM patch scores weighted by balance factor λ.

Results¶

Evaluated on SwapMe test set using cross-training protocol (train on FaceSwap, test on SwapMe to avoid learning algorithm-specific artifacts):

Two-stream network: AUC 0.927
Face classification stream alone: AUC 0.854
Patch triplet stream alone: AUC 0.875
Steganalysis features + SVM baseline: AUC 0.794
CFA pattern method: AUC 0.618
IDC (DCT-based): AUC 0.543

The two-stream approach significantly outperforms prior methods, demonstrating the value of combining complementary detection signals. The method successfully detects tampering even when post-processing (resizing, boundary blurring, blending) is applied.

Connections¶

Related to FaceForensics++: Learning to Detect Manipulated Facial Images via shared focus on face manipulation detection
Related to DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection on broader deepfake/face tampering detection methods
Related to Deepfakes and Disinformation: Exploring the Impact of Synthetic Political Video on Deception, Uncertainty, and Trust in News on connection between manipulated media and misinformation spread
Uses steganalysis features inspired by Exposing Deep Fakes Using Inconsistent Head Poses and other image forensics work
Contributes dataset similar in purpose to The DeepFake Detection Challenge (DFDC) Dataset for face manipulation detection

Notes¶

Strengths: The two-stream approach is well-motivated—combining visual artifacts (high-level) with noise residuals (low-level) provides robustness. The cross-algorithm training/testing protocol is sound for avoiding overfitting to specific face-swapping techniques. Visualization of Class Activation Maps clearly shows the network learns meaningful tampering artifacts. The SwapMe/FaceSwap dataset addresses limitations of prior datasets by focusing specifically on face regions with realistic post-processing.

Limitations: Method struggles with small faces (< 50×50 pixels) due to upsampling loss in the classification stream and patch size constraints in the triplet stream. The triplet network requires steganalysis feature extraction, which adds computational cost; end-to-end learning could be more efficient. Evaluation is limited to the authors' SwapMe/FaceSwap dataset; generalization to other tampering techniques or in-the-wild scenarios is unclear. The SVM is trained per-image at test time, which is computationally expensive at deployment.

Relevance to fake news: Face tampering and deepfakes are important vectors for creating misleading content. While this paper focuses on technical detection rather than misinformation dynamics, it contributes methods for forensic analysis of manipulated media that could be misused in disinformation campaigns.