NELA-GT-2022: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles¶
Authors: Maurício Gruppi, Benjamin D. Horne, Sibel Adalı
Venue: arXiv, 2203.05659 (March 2023)
TL;DR¶
Presents the fifth NELA-GT dataset release: 1.78 million news articles from 361 outlets spanning all of 2022, with source-level veracity labels from Media Bias/Fact Check and embedded tweet data. Focuses on stabilizing collection infrastructure to achieve near-complete temporal coverage, enabling robust benchmarking of fake news detection methods and longitudinal analysis of media coverage across high- and low-credibility sources.
Contributions¶
- Large-scale longitudinal dataset — 1,778,361 articles from 361 sources with complete daily coverage throughout 2022 (estimated minimal missing data)
- Multi-dimensional ground truth — source-level factuality scores (0–5) and conspiracy/pseudoscience classification from Media Bias/Fact Check, with 337 of 361 outlets labeled
- Embedded social context — 346,283 distinct tweets embedded in articles, enabling hybrid-media analysis
- Event-based analysis subsets — pre-filtered collections for Russo-Ukrainian War and Roe v. Wade overturning with keyword lists
- Public-domain release — SQLite and JSON formats with Python tools for data access and extraction
Method¶
RSS feeds from 361 news outlets were scraped twice daily throughout 2022 using feedparser and Goose3 libraries. Outlet list inherited from prior releases. Collected metadata: publication date, content, author, URL, normalized source name. Embedded tweets extracted from article HTML via Goose3 and stored with article ID and tweet text. Media Bias/Fact Check labels applied at source level (no per-article labels, unlike prior gold standards).
Text transformation applied to prevent journalistic re-use while preserving analytical utility: articles >200 tokens have 7 tokens replaced with '@' per 100 tokens; articles ≤200 tokens have 5 consecutive tokens replaced per 20 tokens. Retains approximately 93% of content for analysis.
Results¶
Coverage: Weekly article distribution shows consistent collection (35–40K articles/week) with peaks during major events (Russo-Ukrainian War, Roe v. Wade coverage). Reliable sources contribute ~6–8K articles/week consistently; mixed ~6–8K/week; unreliable sources ~12–15K/week (higher volume of lower-credibility outlets).
Source distribution:
- Reliable: ~90 outlets
- Mixed: ~70 outlets
- Unreliable: ~90 outlets
- Unlabeled: 24 outlets
Embedded tweets: 346,283 distinct tweets with similar temporal patterns to articles (peaks aligned with major events).
Connections¶
- Extends prior NELA-GT-2018, NELA-GT-2019, and NELA-GT-2020 releases, providing 5.5+ years of longitudinal news coverage combined with all prior versions
- Complements FakeNewsNet (article-level labels) and ReCOVery (multimodal COVID-19 data) for benchmarking
- Enables longitudinal study of media dynamics across polarization, election interference, and coordinated campaigns
- Useful for content-based and propagation-based detection research via embedded social signals
Notes¶
The dataset's strength lies in temporal consistency and scale (1.78M articles across the full calendar year) rather than novel label types. Prior NELA releases introduced embedded tweets and MBFC labels; this release stabilizes collection to minimize gaps. The intentional text obfuscation is a significant limitation for content-based NLP approaches—researchers should test robustness to this preprocessing. Event-based subsets (keyword-filtered for Ukraine/Roe v. Wade) are useful for case studies but may introduce keyword-bias artifacts. The absence of per-article ground truth labels (only source-level) limits fine-grained fact-checking or article-level veracity research compared to LIAR or FEVER. Suitable for propagation analysis, event-driven comparative analysis, and robust ML evaluation over time.