NELA-GT-2019¶
Creators: Maurício Gruppi, Benjamin D. Horne, Sibel Adalı — Rensselaer Polytechnic Institute
Citation: [[2020-gruppi-nela-gt-2019|Gruppi, Horne & Adalı (2020)]]
DOI: https://doi.org/10.7910/DVN/O7FWPO
Overview¶
NELA-GT-2019 is a large multi-labelled news dataset comprising 1.12M news articles from 260 sources collected between January 1–December 31, 2019. It is an updated successor to NELA-GT-2018, expanding the dataset with 66 additional sources and providing source-level ground truth labels from seven different assessment sites.
Data¶
Size: 1.12M articles from 260 news sources
Collection period: 2019 (full calendar year)
Language: English
Collection method: RSS feed scraping twice daily via feedparser and goose libraries
Source categories: - Mainstream news outlets (left, center, right bias) - Alternative news sources - Conspiracy/pseudoscience sites (66 newly added)
Temporal coverage: Weekly aggregation shows relatively stable collection patterns across the year with seasonal variations
Labels¶
Source-level ground truth labels from seven assessment sites:
- Media Bias/Fact Check (MBFC) — political leaning and factual reporting score
- Pew Research Center
- Wikipedia
- OpenSources (discontinued; labels carried forward)
- AllSides — media bias assessments
- BuzzFeed News (no longer updated; labels carried forward)
- PolitiFact
Aggregated label¶
A 3-class source-level label computed from MBFC data: - Unreliable — sources flagged as conspiracy/pseudoscience OR with low/very-low factual reporting - Mixed — sources with mixed factual reporting score - Reliable — sources with high/very-high factual reporting
Label coverage: 79% of sources have at least one label; 76% have MBFC labels.
Format¶
SQLite database: Single table (newsdata) with one row per article
JSON format: One JSON file per source containing list of articles
Columns: id, date, source, title, content, author, published, published_utc, collection_utc
Availability¶
Dataset: https://doi.org/10.7910/DVN/O7FWPO
Python extraction code: https://github.com/MELALab/nela-gt-2019
Related datasets¶
- NELA-GT-2018 — predecessor dataset (2018 data)
- FakeNewsNet — social-media-based dataset with engagement features
- ReCOVery — COVID-19-specific news credibility dataset
- CHECKED — Chinese-language COVID-19 misinformation dataset
Use cases¶
The dataset supports several research directions:
- Concept drift in veracity detection — understanding how fake news detection models degrade over time with automatically extracted features
- Semi-supervised learning — leveraging unlabeled and mixed-veracity sources
- Disinformation producer tactics over time — analyzing how false news creation tactics evolve
- Political narratives across events — tracking how news narratives change across major political events and across sources
Notes¶
NELA-GT-2019 differs from NELA-GT-2018 in: (1) 66 additional sources, approximately 400K more articles, and full 12-month data (vs. 10 months in 2018); (2) updated labels with MBFC as primary assessment; (3) NewsGuard labels removed due to paywall/ToS changes; (4) new 3-class aggregated reliability label; (5) JSON format added alongside SQLite. The dataset can be combined with NELA2017 and NELA-GT-2018 for longitudinal studies spanning 2.5+ years.