Benchmark datasets and evaluations¶
Benchmark datasets are curated, labeled corpora released to enable reproducible and comparable evaluation of computational methods. In misinformation research, benchmarks serve three functions: (1) provide ground-truth labels necessary for supervised learning, (2) establish standard train/test splits that allow direct comparison across papers, and (3) create shared evaluation metrics that aggregate findings across studies.
Effective benchmarks in the fake news detection domain must balance scale (enough data for deep learning), label quality (reliable human or expert annotation), and relevance (authentic content from realistic contexts rather than crowdsourced simulations). The move from crowdsourced deceptive reviews to authentic political statements—exemplified by the LIAR dataset—marked a shift toward ecological validity.
Key papers and datasets¶
- Wang (2017) — Liar, Liar Pants on Fire — 12,836 labeled political statements from fact-checking source PolitiFact; foundational benchmark for statement-level fact-checking
- Shu et al. (2018) — FakeNewsNet — multi-source benchmark combining PolitiFact and GossipCop articles with social context and temporal information
- Thorne et al. (2018) — FEVER — 185,445 human-verified claims mapped to Wikipedia evidence; focuses on evidence retrieval and entailment rather than credibility assessment
Related topics¶
- Fake news detection — detection methods are evaluated on benchmark datasets
- Fake news detection datasets