Skip to content
Two-Stream Neural Networks for Tampered Face Detection

Two-Stream Neural Networks for Tampered Face Detection

Authors: Peng Zhou, Xintong Han, Vlad I. Morariu, Larry S. Davis
Institution: University of Maryland, College Park
ArXiv: 1803.11276 — Link

TL;DR

The paper proposes a two-stream neural network for detecting face tampering (face swaps) in images. One stream uses GoogLeNet to classify tampered faces by visual artifacts, while the second stream uses steganalysis features with a triplet network to capture low-level noise inconsistencies. The approach achieves 0.927 AUC and introduces the SwapMe/FaceSwap dataset with 2010 high-quality tampered face images.

Contributions

  • Two-stream architecture combining high-level visual artifact detection with low-level noise residual detection for face tampering
  • Patch-based triplet network refinement of steganalysis features to better capture in-camera noise and CFA patterns
  • SwapMe and FaceSwap dataset: 2010 tampered images created using two different face-swapping algorithms, with diverse identities and realistic post-processing
  • Demonstration that complementary evidence streams improve robustness to post-processing techniques (resizing, blurring)

Method

The two-stream architecture captures different evidence of tampering:

Face Classification Stream: Uses GoogLeNet (Inception V3) fine-tuned to classify whether a face is tampered or authentic. The network learns high-level tampering artifacts such as stitching artifacts near boundaries, unnatural edges around lips, and blurring effects. Input faces are resized to 299×299.

Patch Triplet Stream: Extracts steganalysis features (CFA-aware features capturing local noise residuals) from image patches and refines them using a triplet loss network. The triplet loss ensures patches from the same image cluster close together in embedding space while patches from different images are far apart. This forces the network to learn camera and noise characteristics. For a test image, patches are extracted using a sliding window (128×128 patch, 64-pixel stride). An SVM is trained on-the-fly for each test image to classify patches as tampered (from a different image) or authentic (from the same image).

Fusion: Final tampering score combines GoogLeNet classification output F(q) with averaged SVM patch scores weighted by balance factor λ.

Results

Evaluated on SwapMe test set using cross-training protocol (train on FaceSwap, test on SwapMe to avoid learning algorithm-specific artifacts):

  • Two-stream network: AUC 0.927
  • Face classification stream alone: AUC 0.854
  • Patch triplet stream alone: AUC 0.875
  • Steganalysis features + SVM baseline: AUC 0.794
  • CFA pattern method: AUC 0.618
  • IDC (DCT-based): AUC 0.543

The two-stream approach significantly outperforms prior methods, demonstrating the value of combining complementary detection signals. The method successfully detects tampering even when post-processing (resizing, boundary blurring, blending) is applied.

Connections

Notes

Strengths: The two-stream approach is well-motivated—combining visual artifacts (high-level) with noise residuals (low-level) provides robustness. The cross-algorithm training/testing protocol is sound for avoiding overfitting to specific face-swapping techniques. Visualization of Class Activation Maps clearly shows the network learns meaningful tampering artifacts. The SwapMe/FaceSwap dataset addresses limitations of prior datasets by focusing specifically on face regions with realistic post-processing.

Limitations: Method struggles with small faces (< 50×50 pixels) due to upsampling loss in the classification stream and patch size constraints in the triplet stream. The triplet network requires steganalysis feature extraction, which adds computational cost; end-to-end learning could be more efficient. Evaluation is limited to the authors' SwapMe/FaceSwap dataset; generalization to other tampering techniques or in-the-wild scenarios is unclear. The SVM is trained per-image at test time, which is computationally expensive at deployment.

Relevance to fake news: Face tampering and deepfakes are important vectors for creating misleading content. While this paper focuses on technical detection rather than misinformation dynamics, it contributes methods for forensic analysis of manipulated media that could be misused in disinformation campaigns.