Skip to content

Vision-language models

Vision-language models are neural architectures designed to jointly process and reason over visual (image) and linguistic (text) information. They enable downstream tasks like visual question answering, image-text retrieval, visual commonsense reasoning, and detection of cross-modal mismatches—all relevant to detecting multimodal misinformation.

Key architectures

ViLBERT (Vision & Language BERT): Dual-stream transformer-based model with separate transformer stacks for images and text, connected by co-attentive layers that enable interaction between modalities. Each co-attentive layer computes multi-head attention such that visual tokens attend to textual tokens and vice versa. Pre-trained on Conceptual Captions using masked multimodal modeling and multimodal alignment prediction. Strong zero-shot and few-shot transfer performance across multiple visiolinguistic benchmarks.

CLIP: Contrastive pre-training on image-text pairs from the internet; learns shared embedding space where images and captions with high semantic similarity are close in the embedding space. Enables zero-shot classification by comparing test images to text descriptions of target classes.

BLIP, LLaVA, and other vision-language foundations: Later models that integrate visual encoders with large language models for richer reasoning.

Applications to misinformation

  • Satire detection: Li et al. (2020) fine-tuned ViLBERT to detect satirical news by analyzing headline-image pairs; achieved 93.8% accuracy, demonstrating that early fusion and multi-modal pre-training outperform simple feature concatenation.
  • Cross-modal mismatch detection: Identifying text-image pairs where the content is incongruent (e.g., a serious political claim paired with an absurd image), a signal of manipulation or satire.
  • Fact-checking: Joint visual and textual analysis to verify claims made in news articles or social media posts.

Key papers

Connections

Notes

Vision-language models benefit from large-scale pre-training on diverse image-text pairs (e.g., Conceptual Captions), which enables transfer to downstream tasks with limited labeled data. However, pre-trained models reflect the biases and knowledge in their training data; they may miss context-dependent satire (e.g., satirical articles that require political knowledge to recognize as fake). Fine-tuning on task-specific data is typically necessary for good performance on misinformation detection tasks.