Skip to content

MultiFC

Largest publicly available dataset of naturally occurring factual claims for automatic claim verification. Consists of 34,918 claims collected from 26 fact-checking websites in English, paired with rich metadata and evidence pages retrieved via Google Search.

Key Features

  • 34,918 naturally occurring claims from fact-checking websites (not artificially constructed)
  • 26 fact-checking domains spanning organizational diversity (news outlets, dedicated fact-checking sites, government agencies)
  • Rich metadata: claim text, labels, speakers, fact-checkers, claim dates, publication dates, categories
  • Evidence retrieval: Claims matched with evidence pages via Google Search API; Wikipedia and news sources heavily represented
  • Entity linking: 25,763 unique entities linked to Wikipedia; 42% of claims contain linkable entities
  • Multi-domain labels: Domain-specific veracity label schemas (2–27 distinct labels per domain)

Dataset Statistics

  • Total claims: 34,918 (after deduplication, filtering claims with <5 labels)
  • Domains: 26 fact-checking websites with varying claim counts (20–2,943 per domain)
  • Entities: 25,763 unique entities; average 2.9 entities per claim (range 1–35)
  • Evidence sources: Wikipedia (4.43%), Snopes (3.99%), news outlets (Washington Post, NYT, Guardian)
  • Label distribution: Varies dramatically by domain; global label heterogeneity necessitates multi-task learning approaches

Format

Claims are provided as JSON with: - Claim ID and text - Veracity label (domain-specific) - Metadata (speaker, fact-checker, publication date, claim date, label reason, category) - Evidence pages (URLs, titles, snippets, full text) - Entity annotations (linked to Wikipedia)

Benchmark Results

Best multi-task learning model achieves Macro F1 of 49.2% across domains (17.4% improvement over evidence-agnostic baseline of 41.8%), demonstrating that the dataset presents a challenging testbed for real-world claim verification.

Download

Available at https://github.com/copenlu/multifc

Introduced in: MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims