Skip to content

Multimodal fake news

Detection and generation of disinformation that spans multiple modalities—text combined with images, video, audio, or real-time video deepfakes—where adversaries exploit inconsistencies or misalignments across modalities to evade detection.

Key observations

Multimodal generation is harder to detect: Machine-generated articles with both text and images are significantly harder for humans to identify than text-only articles. Naive users perform at near-random accuracy (46.2%) when articles contain images and captions.

Visual-semantic inconsistency as an attack vector: Generators struggle to maintain semantic consistency across text and visual modalities. Named entities in captions may mismatch the article body, or images may be contextually unrelated despite appearing realistic.

Type C articles are most deceptive: Generated article text paired with real images (Type C) presents the largest challenge for both human and automated detection, as real images provide surface credibility even when text is machine-generated.

Key papers

  • Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News: Introduces the first defense against multimodal neural fake news. Proposes DIDAN to detect visual-semantic inconsistencies via named entity co-occurrence analysis. NeuralNews dataset contains 128K articles across four types; shows humans require explicit visual-semantic cues to detect generated articles.
  • Defending Against Neural Fake News: GROVER generates news articles with high fidelity; foundational for understanding neural text generation attacks that could be extended to multimodal settings.