Vision-language models¶
Vision-language models are neural architectures designed to jointly process and reason over visual (image) and linguistic (text) information. They enable downstream tasks like visual question answering, image-text retrieval, visual commonsense reasoning, and detection of cross-modal mismatches—all relevant to detecting multimodal misinformation.
Key architectures¶
ViLBERT (Vision & Language BERT): Dual-stream transformer-based model with separate transformer stacks for images and text, connected by co-attentive layers that enable interaction between modalities. Each co-attentive layer computes multi-head attention such that visual tokens attend to textual tokens and vice versa. Pre-trained on Conceptual Captions using masked multimodal modeling and multimodal alignment prediction. Strong zero-shot and few-shot transfer performance across multiple visiolinguistic benchmarks.
CLIP: Contrastive pre-training on image-text pairs from the internet; learns shared embedding space where images and captions with high semantic similarity are close in the embedding space. Enables zero-shot classification by comparing test images to text descriptions of target classes.
BLIP, LLaVA, and other vision-language foundations: Later models that integrate visual encoders with large language models for richer reasoning.
Applications to misinformation¶
- Satire detection: Li et al. (2020) fine-tuned ViLBERT to detect satirical news by analyzing headline-image pairs; achieved 93.8% accuracy, demonstrating that early fusion and multi-modal pre-training outperform simple feature concatenation.
- Cross-modal mismatch detection: Identifying text-image pairs where the content is incongruent (e.g., a serious political claim paired with an absurd image), a signal of manipulation or satire.
- Fact-checking: Joint visual and textual analysis to verify claims made in news articles or social media posts.
Key papers¶
- Survey of Hallucination in Natural Language Generation — Survey Section 12 covers object hallucination in image captioning and vision-language model failures in multimodal generation
- A Multi-Modal Method for Satire Detection using Textual and Visual Cues: Fine-tunes ViLBERT for satire detection; demonstrates superiority of early fusion over uni-modal and late-fusion baselines.
- Lu et al. (2019): ViLBERT paper introducing vision-language BERT.
- Radford et al. (2021): CLIP paper (contrastive vision-language pre-training).
Connections¶
- Multimodal fake news detection — vision-language models are state-of-the-art for multimodal fake news detection.
- Deep learning — transformers, BERT, and pre-training.
- Satire detection — specific application domain.
Notes¶
Vision-language models benefit from large-scale pre-training on diverse image-text pairs (e.g., Conceptual Captions), which enables transfer to downstream tasks with limited labeled data. However, pre-trained models reflect the biases and knowledge in their training data; they may miss context-dependent satire (e.g., satirical articles that require political knowledge to recognize as fake). Fine-tuning on task-specific data is typically necessary for good performance on misinformation detection tasks.