NELA-GT-2022¶
Overview: 1,778,361 news articles from 361 outlets collected throughout 2022, with outlet-level veracity labels from Media Bias/Fact Check and metadata on 346,283 embedded tweets.
Collection & Methodology¶
RSS feeds from 361 news outlets were scraped twice daily throughout 2022 using feedparser and Goose3. Articles include standard fields (title, body, author, publication metadata, URL) plus Media Bias/Fact Check source-level credibility labels. This represents the fifth release in the NELA-GT series (following 2018, 2019, 2020, and 2021), with focus on stabilizing collection infrastructure to achieve near-complete coverage across the full year rather than adding new feature types.
Ground Truth Labels¶
Source-level veracity labels from Media Bias/Fact Check include:
- Factuality score — 0–5 scale (low to high credibility), with 337 of 361 outlets labeled
- Conspiracy/Pseudoscience classification — low credibility if source belongs to these categories
- Aggregated reliability — three-class breakdown (reliable, mixed, unreliable)
Embedded Tweets¶
Continuing from the 2020 release, 346,283 distinct tweets embedded into news articles are included in structured format. This enables research into hybrid media systems and the role of low-veracity sources in amplifying social-media content.
Data Format¶
Released in two formats:
- SQLite database — tables for newsdata (article metadata) and tweet (embedded tweet data) with primary keys and foreign key relationships
- JSON — one JSON file per source containing all articles from that outlet
Use Cases¶
- Event-driven analysis — provided subsets for two major 2022 events: the Russo-Ukrainian War and the overturning of Roe v. Wade
- Robust ML evaluation — enables robustness checks over time, across events, and with mixed veracity labels
- Media manipulation research — study how unreliable outlets amplify disinformation and engage in coordinated behavior
- Long-term trends — combined with prior NELA releases (2018–2021), supports 5.5+ years of consistent cross-outlet news coverage
Limitations¶
Article text is intentionally transformed to prevent news consumption (7 tokens per 100 tokens replaced with '@' for large articles, 5 per 20 for smaller) while preserving analytical utility (≈93% of content remains). This prevents copyright misuse while enabling text analysis for misinformation research.
Links¶
- Dataset: https://doi.org/10.7910/DVN/AMCV2H
- Code repository: https://github.com/MELALab/nela-gt
- Related papers: Gruppi, Horne & Adalı (2022)