Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News¶

Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko Venue: arXiv, 2020 — arXiv:2009.07698

TL;DR¶

Machine-generated news articles with images and captions exploit visual-semantic inconsistencies that humans struggle to detect. This paper introduces DIDAN, a named entity-based approach to identify cross-modal semantic mismatches, and the NeuralNews dataset of 128K articles to benchmark multimodal fake news detection. User studies show that naive humans are only 46% accurate at identification, while trained humans reach 68% accuracy with visual-semantic cues.

Contributions¶

First defense against multimodal neural fake news: Prior work addressed text-only generated news (GROVER) or manipulated images/deepfakes separately. This is the first paper to defend against machine-generated articles that include both text and matched images/captions.
NeuralNews dataset: 128K news articles from GoodNews (real) and GROVER-generated (fake), organized into four article types: (A) real articles + real captions, (B) real articles + generated captions, (C) generated articles + real captions, (D) generated articles + generated captions.
DIDAN model: Named entity-based detection of visual-semantic inconsistencies. Computes authenticity scores by measuring named entity co-occurrence between article body and image captions.
Human susceptibility analysis: Comprehensive user studies measuring human ability to detect generated articles with varying levels of training and cue-provision.

Method¶

DIDAN exploits the mismatch between text, images, and captions in generated articles. The core insight: machine generators often fail to maintain semantic consistency across modalities—they may generate plausible article text and realistic images but pair them incorrectly.

Article representation: Articles are represented as sets of sentences and image-caption pairs. Each sentence is encoded to a vector space. Named entity recognition (via SpaCy) extracts entities from both the article body and captions.

Authenticity score: Computed as the probability that an article is human-generated based on the co-occurrence of named entities in the article text and image captions. The model is trained on real vs. generated article pairs using a simple learning paradigm: articles paired with non-matching images/captions are negative samples, matching pairs are positive.

Visual-semantic inconsistency detection: The model identifies mismatches by comparing entity mentions in article text vs. captions. For example, if an article mentions "Prime Minister Theresa May" but the caption refers to "the United Kingdom flag," this inconsistency signals potential generation.

Results¶

User study results: - Naive users: 46.2% accuracy (below random guessing) at identifying generated articles - Trained users given visual-semantic cues: 67.8% accuracy - Type C articles (generated text + real images) are most deceptive (only 42.7% naive user accuracy)

DIDAN performance: - Effectively detects Type C articles (generated bodies with real images) - Achieved strong performance on benchmark datasets - Metadata (authors, date, domain) further improves detection via pretrained generators like GROVER

Key finding: Even when metadata is provided to pretrained models like GROVER, visual-semantic inconsistency remains a critical signal. Humans focus on images/captions when given explicit cues, improving detection accuracy by 21.6 percentage points over naive settings.

Connections¶

Related to GROVER as a defense mechanism against generated text.
Extends work on detecting generated text to multimodal settings.
Builds on Image Caption Generation and visual-semantic matching literature.
Part of broader research on Multimodal Misinformation and cross-modal consistency.

Notes¶

Strengths: First comprehensive treatment of multimodal neural fake news; robust dataset spanning four article types; extensive human studies quantifying the deceptiveness of different article combinations.

Limitations: The approach relies on named entity recognition, which may fail for entities mentioned implicitly or in domains with sparse named entity use. User study sample sizes are modest (MTurk workers). The paper assumes adversaries don't know about visual-semantic checking; an adaptive generator could learn to match entities even when images are unrelated.

Future work: Extending detection to videos (with audio); studying adversarially-robust detectors; developing real-time detection systems; evaluating on real-world disinformation campaigns.