Skip to content
MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims

MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims

Authors: Isabelle Augenstein, Christina Lioma, Dongsheng Wang, Lucas Chaves Lima, Casper Hansen, Christian Hansen, Jakob Grue Simonsen

Affiliation: University of Copenhagen, Department of Computer Science

Venue: arXiv preprint, 2019 — arXiv:1909.03242

TL;DR

MultiFC is the largest publicly available real-world dataset of naturally occurring factual claims (34,918 claims) collected from 26 fact-checking websites, with rich metadata, evidence pages retrieved via Google Search, and entity linking to Wikipedia. The paper demonstrates that evidence ranking combined with veracity prediction significantly improves over evidence-agnostic baselines, achieving a best Macro F1 of 49.2%, and shows that multi-domain heterogeneity presents a challenging testbed for fact verification.

Contributions

  • Largest real-world fact-checking dataset: 34,918 naturally occurring claims collected from 26 fact-checking websites in English, substantially larger than prior datasets (prior largest was ~10,564)
  • Rich metadata and annotations: Each claim includes categorical metadata (reason for label, speaker, fact-checker, publication date, claim date), entity linking to Wikipedia (25,763 unique entities, 15,351 claims contain Wikipedia-linkable entities), and verbatim text extracted from fact-checking websites
  • Evidence retrieval pipeline: Evidence pages retrieved via Google Search API using claim text as query; resulting corpus includes Wikipedia, news outlets, government sites, and other authoritative sources
  • Multi-domain veracity labels: Claims span 26 fact-checking domains with heterogeneous label schemas (ranging from 2 to 27 distinct labels per domain), necessitating multi-task learning approaches
  • Joint evidence and veracity prediction: Novel multi-task learning approach jointly learns to rank evidence pages and predict claim veracity across domains with disparate label spaces, achieving Macro F1 of 49.2%
  • Comprehensive analysis: Dataset characterization identifying entity distributions, evidence source prevalence, domain-level label heterogeneity, and performance analysis with baseline models

Method

Data collection involves three stages:

Stage 1: Website Crawling. All active English-language fact-checking websites listed on Duke Reporters' Lab and the Fact Checking Wikipedia page were targeted for crawling. Ten websites could not be crawled due to protection (SSL/TLS, timeouts, paywalls, or crawling restrictions); crawling succeeded for 38 websites spanning 26 domains. Each domain is a specific fact-checking organization (e.g., PolitiFact, FactCheck.org), with some organizations managing multiple subdomains.

Stage 2: Claim and Evidence Extraction. From each crawled website, claims and associated metadata are extracted: claim text, claim ID, verification label, fact-checker identity, speaker, claim date, publication date, and claim category/reason for label. When available, hyperlinked text and full article text are captured. Claim text is submitted verbatim to Google Search API without quotes; the 10 most highly ranked results are retrieved as evidence pages, storing title, URL, timestamp, and full web page content.

Stage 3: Entity Linking. Named entities (persons, organizations, locations, other entities) within claims are recognized and linked to Wikipedia pages using the state-of-the-art neural entity linking model (Kolitsas et al., 2018). For claims with multiple references to the same entity, disambiguation via Wikipedia link context is applied.

Veracity prediction modeling employs multi-task learning (MTL) to handle domain heterogeneity:

  1. Base architecture: Deep neural network with shared hidden layer (h ∈ ℝ^d) and task-specific softmax output layers. For each domain's veracity prediction task, a task-specific weight matrix W^T_i ∈ ℝ^{L_i × h} and bias b^T_i ∈ ℝ^{L_i} define per-domain label distributions
  2. Input encoding: Claims encoded as concatenation of claim text representation and optional metadata (speaker, fact-checker, claim category, publication metadata)
  3. Evidence encoding: Evidence pages ranked using claim-evidence similarity; top-ranked evidence incorporated into claim representation via weighted average of evidence embeddings
  4. Multi-task training: Joint optimization across all domain-specific veracity tasks; label embedding layers project disparate label spaces into fixed-length embedding space, enabling knowledge transfer across domains

Results

Dataset statistics:

  • Total claims: 34,918 (after removing duplicates and claims occurring <5 times per label)
  • Domain coverage: 26 fact-checking websites (ranging from 20 to 2,943 claims per domain)
  • Entity statistics: 25,763 unique entities detected; 15,351 claims (42.0%) contain entities linkable to Wikipedia; average 2.9 entities per claim; maximum 35 entities in single claim
  • Evidence retrieval: Successfully retrieved evidence pages for majority of claims; corpus includes Wikipedia (4.425%), Snopes (3.992%), Washington Post (3.025%), New York Times (2.478%), and 30+ other sources
  • Label heterogeneity: Domains employ 2–27 distinct veracity labels (e.g., "correct/incorrect" vs. "true/mostly true/half true/misleading/mostly false/false/unsubstantiated")

Baseline model performance:

  • Evidence-agnostic baseline: Macro F1 = 41.8% across domains
  • Evidence ranking + veracity prediction: Macro F1 = 49.2% (17.4% relative improvement)
  • Performance variation: Domain-level F1 ranges from 30–75%, reflecting label heterogeneity and claim difficulty

Cross-domain analysis:

  • Multi-task learning with shared hidden representation outperforms domain-specific models, demonstrating benefit of learning across domains
  • Encoding metadata (speaker, fact-checker, reason for label) contributes to improved performance
  • Evidence pages assist veracity prediction; domains with longer evidence documents and more available evidence achieve higher accuracy

Connections

Notes

Strengths:

  • Largest real-world fact-checking dataset at publication (34,918 claims); naturally occurring claims from 26 diverse fact-checking sources provide broad coverage of misinformation phenomena
  • Rich metadata (entities, speakers, categories, temporal information) and evidence pages enable complex feature engineering and evidence-based approaches
  • Multi-domain nature with heterogeneous labels reflects real-world fact-checking practice and motivates transfer learning approaches
  • Comprehensive dataset analysis identifies key challenges: label heterogeneity, domain variation, and entity distribution
  • Entity linking to Wikipedia connects claims to structured knowledge, enabling knowledge-graph-based approaches

Weaknesses:

  • Label heterogeneity (2–27 labels per domain) complicates unified evaluation and model comparison; label remapping necessary for learning, potentially introducing error
  • Evidence retrieval via Google Search may miss relevant sources or introduce temporal bias; changes in search rankings over time affect reproducibility
  • Not all fact-checking websites provide uniform coverage or metadata; some domains under-represented
  • English-only coverage limits applicability to multilingual misinformation
  • Best model achieves Macro F1 of 49.2%, indicating substantial remaining difficulty; performance varies widely across domains
  • Evidence pages sometimes unavailable or off-topic; correlation between evidence availability and accuracy not fully characterized

Future directions:

  • Cross-lingual fact verification using multilingual sources
  • Temporal fact verification incorporating claim publication and fact-checking dates
  • Fine-grained veracity labels capturing nuance beyond binary true/false
  • Integration of knowledge graphs and structured knowledge bases
  • Real-time claim verification on emerging claims
  • Investigation of domain transfer and few-shot learning for low-resource fact-checking sites