Fake news detection datasets and benchmarks¶

A comprehensive fake news detection dataset must provide: (1) news content (text and/or images), (2) ground truth labels (verified fact checks or manual annotation), and (3) ideally, social context (user engagement, network information). Early datasets focused on news content alone; recent datasets integrate social context and spatiotemporal information, enabling research into how information spreads rather than just what makes content deceptive.

Major datasets¶

MultiFC (Augenstein et al. 2019) - 34,918 naturally occurring claims from 26 fact-checking websites across diverse domains - Rich metadata: entities, speakers, fact-checkers, temporal information - Entity linking to Wikipedia (25,763 unique entities) - Evidence pages retrieved via Google Search - Multi-domain heterogeneous labels (2–27 distinct labels per domain) - Demonstrates challenge of real-world multi-domain fact verification; best model achieves Macro F1 of 49.2%

Fakeddit (Nakamura, Levy, Wang 2019) - 1,063,106 Reddit submissions from 22 subreddits (64% multimodal with text+image) - Diverse source: true, satire/parody, misleading content, imposter content, false connection, manipulated content - Multiple label granularities: 2-way (fake/true), 3-way (true/misleading-true/false), 6-way (fine-grained) - Includes metadata and comment data for social context research - Distant supervision via subreddit labels; Cohen's Kappa = 0.54 on manual validation - Demonstrates multimodal advantage (~10 pp improvement over text-only) and identifies satire/imposter as challenging categories

FakeNewsNet (Shu et al. 2018) - Two fact-checking sources: PolitiFact (12,911 articles) and GossipCop (22,140 articles) - Includes news content, user engagement data, network structure, and spatiotemporal information - Ground truth from journalist fact-checkers - Multi-dimensional: linguistic, visual, social context, and temporal - Enables research into detection, evolution, and mitigation

NELA-GT-2018 (Nørregaard, Horne & Adalı 2019) - 713,534 articles from 194 sources collected Feb–Nov 2018 - Engagement-independent collection: scraped directly from source RSS feeds, not social media - Source-level ground truth labels from 8 independent assessment sites (NewsGuard, Pew Research, Wikipedia, OpenSources, Media Bias/Fact Check, AllSides, BuzzFeed News, PolitiFact) - Multi-dimensional: reliability, bias, transparency, journalistic standards, consumer trust - Addresses gaps in prior datasets: large scale, diverse sources, non-engagement-driven, multi-dimensional labels - Designed for distant-supervised learning (source labels as article-level proxies), semi-supervised learning, and longitudinal tactics analysis - Predecessor to NELA-GT-2019 which extends to 1.12M articles from 260 sources in 2019

Other notable datasets:

LIAR (Wang 2017)
12,836 labeled short statements from PolitiFact spanning 2007–2016
6-way fine-grained labels: pants-fire, false, barely-true, half-true, mostly-true, true
Rich metadata: speaker name, party affiliation, state, job, credit history (prior statement accuracy)
Detailed fact-check justifications and links to supporting documents from PolitiFact
Focus on statement-level fact-checking (not full articles); political domain emphasis
Benchmark results show hybrid CNN integrating text + metadata outperforms text-only approaches

Other notable datasets (referenced in FakeNewsNet paper): - BuzzFeedNews: News articles with visual content and engagement (small scale) - FEVER: Thorne et al. 2018 — 185,445 human-generated claims verified against Wikipedia; classification into SUPPORTED / REFUTED / NOT ENOUGH INFO with annotated evidence sentences; focus on evidence retrieval and textual entailment rather than social context. Also see the shared task paper reporting competition results. - BS Detector (CREDBANK): Browser extension output; user-generated claims with credibility annotations - BuzzFace: Facebook news with engagement but limited article coverage - FacebookHoax: Facebook posts from 32 pages; fact-checked by human annotators

Dataset design considerations¶

Ground truth sources: - Journalist/expert fact-checkers: reliable but limited in scale and currency - Crowdsourced manual annotation: scalable but prone to disagreement and quality variance - Platform labels (e.g., community notes, fact-check labels): real-world but platform-dependent

Feature coverage: - News content: text, images, metadata (headline, author, publish date, source URL) - Social context: user engagement (likes, shares, replies), user profiles, network topology - Spatiotemporal: location data, timestamps, temporal engagement patterns

Scope and coverage: - Domain: political news, entertainment, health, general - Language: English-dominated; limited multilingual datasets - Temporal coverage: archival datasets vs. ongoing collection - Platform: Twitter-centric; limited Facebook, Instagram, or TikTok data

Connections¶

Content-based detection — content features from these datasets enable text/image analysis approaches.
Social-context detection — datasets with rich social signals enable propagation and network-based methods.
Fake news detection — benchmark datasets drive reproducibility and comparison of methods.