Skip to content

GAN-Generated Video

GAN-based video synthesis generates photorealistic video sequences using generative adversarial networks in which a generator network produces increasingly realistic frames while a discriminator network provides feedback. This technology underpins deepfakes, synthetic media, and face-swapping applications.

Technical mechanisms

Autoencoder-based face swapping: An encoder extracts facial identity features from a source face while a decoder reconstructs the face onto a target video, preserving facial expressions, lip movements, and head pose from the target video.

Frame-by-frame synthesis: GANs generate individual video frames, with temporal consistency enforced through either:

  • Optical flow guidance: Using computed motion between frames to constrain generation
  • Recurrent architectures (LSTM, ConvLSTM): Maintaining hidden state across frames to ensure coherent temporal dynamics
  • Temporal discriminators: Discriminators that evaluate both frame realism and temporal smoothness

Blending and composition: Generated faces are blended with target video backgrounds using:

  • Face segmentation masks: Detecting the facial region and blending in the generated face
  • Facial landmark alignment: Aligning generated and original faces via computed keypoints (eyes, nose, mouth corners)
  • Histogram normalization: Adjusting lighting and color to match the target video's environment

Quality factors

GAN training duration, iterations, and model size directly impact output realism:

  • Low-quality (64×64): Noticeably synthetic; detection is easier but still challenging
  • High-quality (128×128+): Highly realistic; detection error rates exceed 8% even with specialized methods

Resolution is the primary factor driving detection difficulty—higher resolution deepfakes contain more visual evidence of realism while maintaining fewer detectable artifacts.

Key papers in this wiki