Multimodal fake news¶

Detection and generation of disinformation that spans multiple modalities—text combined with images, video, audio, or real-time video deepfakes—where adversaries exploit inconsistencies or misalignments across modalities to evade detection.

Key observations¶

Multimodal generation is harder to detect: Machine-generated articles with both text and images are significantly harder for humans to identify than text-only articles. Naive users perform at near-random accuracy (46.2%) when articles contain images and captions.

Visual-semantic inconsistency as an attack vector: Generators struggle to maintain semantic consistency across text and visual modalities. Named entities in captions may mismatch the article body, or images may be contextually unrelated despite appearing realistic.

Type C articles are most deceptive: Generated article text paired with real images (Type C) presents the largest challenge for both human and automated detection, as real images provide surface credibility even when text is machine-generated.

Key papers¶

Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News: Introduces the first defense against multimodal neural fake news. Proposes DIDAN to detect visual-semantic inconsistencies via named entity co-occurrence analysis. NeuralNews dataset contains 128K articles across four types; shows humans require explicit visual-semantic cues to detect generated articles.
Defending Against Neural Fake News: GROVER generates news articles with high fidelity; foundational for understanding neural text generation attacks that could be extended to multimodal settings.

Fake news detection — general fake news detection methods
Visual-semantic inconsistency — detecting mismatches between images and text
Image Caption Generation — generating captions for images
Neural text generation — detecting machine-generated text
Disinformation — understanding and countering false information

Multimodal fake news¶

Key observations¶

Key papers¶

Related topics¶