Skip to content
NELA-GT-2022: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles

NELA-GT-2022: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles

Authors: Maurício Gruppi, Benjamin D. Horne, Sibel Adalı

Venue: arXiv, 2203.05659 (March 2023)

TL;DR

Presents the fifth NELA-GT dataset release: 1.78 million news articles from 361 outlets spanning all of 2022, with source-level veracity labels from Media Bias/Fact Check and embedded tweet data. Focuses on stabilizing collection infrastructure to achieve near-complete temporal coverage, enabling robust benchmarking of fake news detection methods and longitudinal analysis of media coverage across high- and low-credibility sources.

Contributions

  • Large-scale longitudinal dataset — 1,778,361 articles from 361 sources with complete daily coverage throughout 2022 (estimated minimal missing data)
  • Multi-dimensional ground truth — source-level factuality scores (0–5) and conspiracy/pseudoscience classification from Media Bias/Fact Check, with 337 of 361 outlets labeled
  • Embedded social context — 346,283 distinct tweets embedded in articles, enabling hybrid-media analysis
  • Event-based analysis subsets — pre-filtered collections for Russo-Ukrainian War and Roe v. Wade overturning with keyword lists
  • Public-domain release — SQLite and JSON formats with Python tools for data access and extraction

Method

RSS feeds from 361 news outlets were scraped twice daily throughout 2022 using feedparser and Goose3 libraries. Outlet list inherited from prior releases. Collected metadata: publication date, content, author, URL, normalized source name. Embedded tweets extracted from article HTML via Goose3 and stored with article ID and tweet text. Media Bias/Fact Check labels applied at source level (no per-article labels, unlike prior gold standards).

Text transformation applied to prevent journalistic re-use while preserving analytical utility: articles >200 tokens have 7 tokens replaced with '@' per 100 tokens; articles ≤200 tokens have 5 consecutive tokens replaced per 20 tokens. Retains approximately 93% of content for analysis.

Results

Coverage: Weekly article distribution shows consistent collection (35–40K articles/week) with peaks during major events (Russo-Ukrainian War, Roe v. Wade coverage). Reliable sources contribute ~6–8K articles/week consistently; mixed ~6–8K/week; unreliable sources ~12–15K/week (higher volume of lower-credibility outlets).

Source distribution: - Reliable: ~90 outlets - Mixed: ~70 outlets
- Unreliable: ~90 outlets - Unlabeled: 24 outlets

Embedded tweets: 346,283 distinct tweets with similar temporal patterns to articles (peaks aligned with major events).

Connections

Notes

The dataset's strength lies in temporal consistency and scale (1.78M articles across the full calendar year) rather than novel label types. Prior NELA releases introduced embedded tweets and MBFC labels; this release stabilizes collection to minimize gaps. The intentional text obfuscation is a significant limitation for content-based NLP approaches—researchers should test robustness to this preprocessing. Event-based subsets (keyword-filtered for Ukraine/Roe v. Wade) are useful for case studies but may introduce keyword-bias artifacts. The absence of per-article ground truth labels (only source-level) limits fine-grained fact-checking or article-level veracity research compared to LIAR or FEVER. Suitable for propagation analysis, event-driven comparative analysis, and robust ML evaluation over time.