NELA-GT-2019¶

Creators: Maurício Gruppi, Benjamin D. Horne, Sibel Adalı — Rensselaer Polytechnic Institute

Citation: Gruppi, Horne & Adalı (2020) — arXiv:2003.08444

DOI: https://doi.org/10.7910/DVN/O7FWPO

Overview¶

NELA-GT-2019 is a large multi-labelled news dataset comprising 1.12M news articles from 260 sources collected between January 1–December 31, 2019. It is an updated successor to NELA-GT-2018, expanding the dataset with 66 additional sources and providing source-level ground truth labels from seven different assessment sites.

Data¶

Size: 1.12M articles from 260 news sources

Collection period: 2019 (full calendar year)

Language: English

Collection method: RSS feed scraping twice daily via feedparser and goose libraries

Source categories: - Mainstream news outlets (left, center, right bias) - Alternative news sources - Conspiracy/pseudoscience sites (66 newly added)

Temporal coverage: Weekly aggregation shows relatively stable collection patterns across the year with seasonal variations

Labels¶

Source-level ground truth labels from seven assessment sites:

Media Bias/Fact Check (MBFC) — political leaning and factual reporting score
Pew Research Center
Wikipedia
OpenSources (discontinued; labels carried forward)
AllSides — media bias assessments
BuzzFeed News (no longer updated; labels carried forward)
PolitiFact

Aggregated label¶

A 3-class source-level label computed from MBFC data: - Unreliable — sources flagged as conspiracy/pseudoscience OR with low/very-low factual reporting - Mixed — sources with mixed factual reporting score - Reliable — sources with high/very-high factual reporting

Label coverage: 79% of sources have at least one label; 76% have MBFC labels.

Format¶

SQLite database: Single table (newsdata) with one row per article

JSON format: One JSON file per source containing list of articles

Columns: id, date, source, title, content, author, published, published_utc, collection_utc

Availability¶

Dataset: https://doi.org/10.7910/DVN/O7FWPO

Python extraction code: https://github.com/MELALab/nela-gt-2019

NELA-GT-2018 — predecessor dataset (2018 data)
FakeNewsNet — social-media-based dataset with engagement features
ReCOVery — COVID-19-specific news credibility dataset
CHECKED — Chinese-language COVID-19 misinformation dataset

Use cases¶

The dataset supports several research directions:

Concept drift in veracity detection — understanding how fake news detection models degrade over time with automatically extracted features
Semi-supervised learning — leveraging unlabeled and mixed-veracity sources
Disinformation producer tactics over time — analyzing how false news creation tactics evolve
Political narratives across events — tracking how news narratives change across major political events and across sources

Notes¶

NELA-GT-2019 differs from NELA-GT-2018 in: (1) 66 additional sources, approximately 400K more articles, and full 12-month data (vs. 10 months in 2018); (2) updated labels with MBFC as primary assessment; (3) NewsGuard labels removed due to paywall/ToS changes; (4) new 3-class aggregated reliability label; (5) JSON format added alongside SQLite. The dataset can be combined with NELA2017 and NELA-GT-2018 for longitudinal studies spanning 2.5+ years.