Multimodal fake news¶
Detection and generation of disinformation that spans multiple modalities—text combined with images, video, audio, or real-time video deepfakes—where adversaries exploit inconsistencies or misalignments across modalities to evade detection.
Key observations¶
Multimodal generation is harder to detect: Machine-generated articles with both text and images are significantly harder for humans to identify than text-only articles. Naive users perform at near-random accuracy (46.2%) when articles contain images and captions.
Visual-semantic inconsistency as an attack vector: Generators struggle to maintain semantic consistency across text and visual modalities. Named entities in captions may mismatch the article body, or images may be contextually unrelated despite appearing realistic.
Type C articles are most deceptive: Generated article text paired with real images (Type C) presents the largest challenge for both human and automated detection, as real images provide surface credibility even when text is machine-generated.
Key papers¶
- Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News: Introduces the first defense against multimodal neural fake news. Proposes DIDAN to detect visual-semantic inconsistencies via named entity co-occurrence analysis. NeuralNews dataset contains 128K articles across four types; shows humans require explicit visual-semantic cues to detect generated articles.
- Defending Against Neural Fake News: GROVER generates news articles with high fidelity; foundational for understanding neural text generation attacks that could be extended to multimodal settings.
Related topics¶
- Fake news detection — general fake news detection methods
- Visual-semantic inconsistency — detecting mismatches between images and text
- Image Caption Generation — generating captions for images
- Neural text generation — detecting machine-generated text
- Disinformation — understanding and countering false information