Vision-language models¶

Vision-language models are neural architectures designed to jointly process and reason over visual (image) and linguistic (text) information. They enable downstream tasks like visual question answering, image-text retrieval, visual commonsense reasoning, and detection of cross-modal mismatches—all relevant to detecting multimodal misinformation.

Key architectures¶

ViLBERT (Vision & Language BERT): Dual-stream transformer-based model with separate transformer stacks for images and text, connected by co-attentive layers that enable interaction between modalities. Each co-attentive layer computes multi-head attention such that visual tokens attend to textual tokens and vice versa. Pre-trained on Conceptual Captions using masked multimodal modeling and multimodal alignment prediction. Strong zero-shot and few-shot transfer performance across multiple visiolinguistic benchmarks.

CLIP: Contrastive pre-training on image-text pairs from the internet; learns shared embedding space where images and captions with high semantic similarity are close in the embedding space. Enables zero-shot classification by comparing test images to text descriptions of target classes.

BLIP, LLaVA, and other vision-language foundations: Later models that integrate visual encoders with large language models for richer reasoning.

Applications to misinformation¶

Satire detection: Li et al. (2020) fine-tuned ViLBERT to detect satirical news by analyzing headline-image pairs; achieved 93.8% accuracy, demonstrating that early fusion and multi-modal pre-training outperform simple feature concatenation.
Cross-modal mismatch detection: Identifying text-image pairs where the content is incongruent (e.g., a serious political claim paired with an absurd image), a signal of manipulation or satire.
Fact-checking: Joint visual and textual analysis to verify claims made in news articles or social media posts.

Key papers¶

Survey of Hallucination in Natural Language Generation — Survey Section 12 covers object hallucination in image captioning and vision-language model failures in multimodal generation
A Multi-Modal Method for Satire Detection using Textual and Visual Cues: Fine-tunes ViLBERT for satire detection; demonstrates superiority of early fusion over uni-modal and late-fusion baselines.
Lu et al. (2019): ViLBERT paper introducing vision-language BERT.
Radford et al. (2021): CLIP paper (contrastive vision-language pre-training).

Connections¶

Multimodal fake news detection — vision-language models are state-of-the-art for multimodal fake news detection.
Deep learning — transformers, BERT, and pre-training.
Satire detection — specific application domain.

Notes¶

Vision-language models benefit from large-scale pre-training on diverse image-text pairs (e.g., Conceptual Captions), which enables transfer to downstream tasks with limited labeled data. However, pre-trained models reflect the biases and knowledge in their training data; they may miss context-dependent satire (e.g., satirical articles that require political knowledge to recognize as fake). Fine-tuning on task-specific data is typically necessary for good performance on misinformation detection tasks.