Skip to content

NELA-GT-2019

Creators: Maurício Gruppi, Benjamin D. Horne, Sibel Adalı — Rensselaer Polytechnic Institute

Citation: [[2020-gruppi-nela-gt-2019|Gruppi, Horne & Adalı (2020)]]

DOI: https://doi.org/10.7910/DVN/O7FWPO

Overview

NELA-GT-2019 is a large multi-labelled news dataset comprising 1.12M news articles from 260 sources collected between January 1–December 31, 2019. It is an updated successor to NELA-GT-2018, expanding the dataset with 66 additional sources and providing source-level ground truth labels from seven different assessment sites.

Data

Size: 1.12M articles from 260 news sources

Collection period: 2019 (full calendar year)

Language: English

Collection method: RSS feed scraping twice daily via feedparser and goose libraries

Source categories: - Mainstream news outlets (left, center, right bias) - Alternative news sources - Conspiracy/pseudoscience sites (66 newly added)

Temporal coverage: Weekly aggregation shows relatively stable collection patterns across the year with seasonal variations

Labels

Source-level ground truth labels from seven assessment sites:

  1. Media Bias/Fact Check (MBFC) — political leaning and factual reporting score
  2. Pew Research Center
  3. Wikipedia
  4. OpenSources (discontinued; labels carried forward)
  5. AllSides — media bias assessments
  6. BuzzFeed News (no longer updated; labels carried forward)
  7. PolitiFact

Aggregated label

A 3-class source-level label computed from MBFC data: - Unreliable — sources flagged as conspiracy/pseudoscience OR with low/very-low factual reporting - Mixed — sources with mixed factual reporting score - Reliable — sources with high/very-high factual reporting

Label coverage: 79% of sources have at least one label; 76% have MBFC labels.

Format

SQLite database: Single table (newsdata) with one row per article

JSON format: One JSON file per source containing list of articles

Columns: id, date, source, title, content, author, published, published_utc, collection_utc

Availability

Dataset: https://doi.org/10.7910/DVN/O7FWPO

Python extraction code: https://github.com/MELALab/nela-gt-2019

  • NELA-GT-2018 — predecessor dataset (2018 data)
  • FakeNewsNet — social-media-based dataset with engagement features
  • ReCOVery — COVID-19-specific news credibility dataset
  • CHECKED — Chinese-language COVID-19 misinformation dataset

Use cases

The dataset supports several research directions:

  • Concept drift in veracity detection — understanding how fake news detection models degrade over time with automatically extracted features
  • Semi-supervised learning — leveraging unlabeled and mixed-veracity sources
  • Disinformation producer tactics over time — analyzing how false news creation tactics evolve
  • Political narratives across events — tracking how news narratives change across major political events and across sources

Notes

NELA-GT-2019 differs from NELA-GT-2018 in: (1) 66 additional sources, approximately 400K more articles, and full 12-month data (vs. 10 months in 2018); (2) updated labels with MBFC as primary assessment; (3) NewsGuard labels removed due to paywall/ToS changes; (4) new 3-class aggregated reliability label; (5) JSON format added alongside SQLite. The dataset can be combined with NELA2017 and NELA-GT-2018 for longitudinal studies spanning 2.5+ years.