Datasets and benchmarks¶
Curated datasets of labeled news articles, social media posts, claims, and propagation graphs are foundational to misinformation research. They enable reproducible evaluation of detection algorithms, comparison across methods, and identification of dataset-specific biases and limitations.
This topic covers structured collections with explicit ground truth labels (real/fake, credible/non-credible, or stance judgments), often accompanied by metadata such as source information, publication date, user engagement metrics, and multimedia content. The quality and scope of these datasets determines what claims future detection systems can make.
Key papers¶
- Horne et al. (2018) — NELA2017: Large-scale news source characterization dataset (1,586 articles from 92 diverse sources including mainstream, hyper-partisan, satire, and known misinformation sources). Introduces 130 content-based features (linguistic, sentiment, engagement, bias, morality) enabling comparative analysis of source behavior. Foundation for subsequent NELA releases (GT-2018, GT-2019, GT-2022).
- Wang et al. (2017) — LIAR: The foundational claim-level misinformation dataset with 12,800 political fact-checked statements from PolitiFact, labeled on a 6-point veracity scale (pants-fire, false, barely-true, mostly-true, true, and promise-kept). Established the paradigm for supervised detection and introduced fine-grained veracity assessment.
- Shu et al. (2018) — FakeNewsNet: Largest news-level fake news dataset at the time (27,528 articles from 50 fake news outlets and 27,500 from reliable outlets), with social propagation graphs from Twitter (over 500K tweet engagements). Pioneered graph-based detection and revealed power-law retweet distributions in fake news spread.
- Nakamura et al. (2019) — Fakeddit: Multimodal fake news dataset (663K posts) from Reddit with image and text, enabling research on visual misinformation. Posts labeled as real/fake/satire; shows that image-based signals can outperform text.
- Cui & Lee (2020) — CoAID: COVID-19 healthcare misinformation dataset combining news articles, claims, tweets, and platform posts (4,251 articles, 28 false/454 true claims, 926 posts, 296K engagements). Bridges diverse information types and social context in a single resource.
- Fighting an Infodemic: COVID-19 Fake News Dataset: COVID-19 fake news dataset with 10,700 annotated posts/articles (5,600 real from verified Twitter sources, 5,100 fake from fact-checking websites) balanced across splits. Benchmarks four ML baselines achieving 93.32% F1-score with SVM; code and data publicly available.
- Yang et al. (2020) — CHECKED: First Chinese-language COVID-19 fake news dataset (2,104 Weibo microblogs with per-item expert labels, multimedia, full propagation graphs). Demonstrates dataset design for non-English, social-platform-native content.
- Li et al. (2020) — MM-COVID: Multilingual COVID-19 fake news dataset (3,981 articles in 6 languages) with 7,192 tweets, enabling cross-lingual transfer learning research.
- Thorne et al. (2018) — FEVER: Fact verification dataset (185K claims and 5.4M Wikipedia sentences) establishing the task of evidence retrieval and entailment for claim verification. Shifted paradigm from binary classification to multi-step reasoning.
- Thorne et al. (2018) — FEVER shared task: Annual competition and benchmark that drove rapid progress in evidence-based fact verification, spawning FEVER 2.0 with adversarial claims designed to fool systems.
Related topics¶
- COVID-19 misinformation has spawned multiple datasets (CoAID, ReCOVery, CHECKED, MM-COVID) due to the urgent need for domain-specific evaluation.
- Fake news detection and misinformation detection rely entirely on these datasets for training and evaluation.
- Social-context detection uses propagation graphs and user engagement data from datasets like FakeNewsNet and CoAID.
- Multimodal detection exploits datasets with image/video content (Fakeddit, CHECKED, MM-COVID).
- Credibility assessment is operationalized via these datasets with various labeling schemes (binary real/fake, multi-point veracity scales, publisher-level labels).
Notes¶
Dataset design tensions:
- Granularity: Claim-level labels (LIAR, FEVER) enable fine-grained evaluation but require manual annotation; article-level labels (FakeNewsNet, CoAID) scale better but obscure nuance.
- Source selection: Using fake news outlets as negative examples (FakeNewsNet) risks encoding outlet bias rather than content-level falsity; CoAID mitigates this by mixing sources and engagement.
- Temporal drift: Datasets collected in 2017 (LIAR) may not reflect the evolving tactics, narrative framing, and platform affordances of 2024. COVID-19 datasets reveal rapid narrative shifts (e.g., "5G spreads COVID" peaked and vanished within weeks).
- Language and geography: Most large benchmarks are English and US-centric; CHECKED and MM-COVID begin to address this gap.
- Class imbalance: Real news vastly outnumbers fake news in natural distributions (>10:1 in CoAID), making this a challenging skewed classification problem; dataset-level resampling vs. method-level robustness are ongoing tensions.
Dataset reuse: The most useful datasets (LIAR, FakeNewsNet, CoAID) have enabled hundreds of downstream studies and become de facto standards for comparison. However, this reuse can create a "dataset bias" problem where methods overfit to specific label artifacts or source biases present in one dataset and fail to generalize.