Skip to content

Visual-semantic inconsistency

Detection of semantic mismatches and contradictions between visual content (images) and linguistic content (captions, article text). Machine-generated multimodal content often exhibits inconsistencies that can be leveraged for detection.

Key concepts

Cross-modal semantic alignment: For authentic news articles, images, captions, and article text should be semantically coherent. The named entities, objects, and concepts mentioned in the article body should align with those visible in or described by images.

Named entity alignment: A key indicator of visual-semantic consistency is whether named entities (people, places, organizations) mentioned in the article text are also present in image captions. Machine generators often fail to maintain this alignment.

Authenticity scores: Quantitative metrics can measure the probability an article is human-generated based on named entity co-occurrence patterns between text and captions, forming the basis for automated detection.

Key papers

  • Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News: Proposes DIDAN, a named entity-based approach to detect visual-semantic inconsistencies in news articles. Shows that humans focusing on visual-semantic cues improve detection accuracy from 46.2% (naive) to 67.8% (trained), demonstrating the importance of cross-modal consistency signals.