NELA-GT-2020¶

Creators: Maurício Gruppi, Benjamin D. Horne, Sibel Adalı — Rensselaer Polytechnic Institute

Citation: Gruppi, Horne & Adalı (2021) — arXiv:2102.04567

DOI: https://doi.org/10.7910/DVN/CHMUYZ

Overview¶

NELA-GT-2020 is a large multi-labelled news dataset comprising 1.78M news articles from 519 sources collected throughout 2020 (January 1–December 31). It is an updated successor to NELA-GT-2019, nearly doubling the number of sources and introducing a novel feature: embedded tweets found within news articles. The dataset captures two major 2020 events—the COVID-19 pandemic and the U.S. Presidential Election—with expansions into health-related news beyond the primarily political coverage of previous versions.

Data¶

Size: 1.78M articles from 519 news sources

Collection period: 2020 (full calendar year, with minor outage in weeks 13–15)

Language: English

Collection method: RSS feed scraping twice daily via feedparser and goose libraries

Source categories: - Mainstream news outlets (left, center, right bias) - Alternative news sources - Conspiracy-driven and pseudoscience media (258 additional sources vs. 2019)

Temporal coverage: Weekly aggregation shows collection activity across 52 weeks; weeks 13–15 (March 25–April 8) experienced technical outages; estimated ~15K missing articles (~0.8% of dataset)

Embedded Tweets¶

Novel feature: 410,432 embedded tweets extracted from the news articles

Tweets feature URLs, publication dates, author information
Single article may contain multiple tweets; single tweet may appear in multiple articles
Raw HTML stored in tweet table with foreign key to parent article
Enables network analysis of article-tweet relationships

Labels¶

Source-level ground truth labels from Media Bias/Fact Check (MBFC) covering two dimensions of veracity:

MBFC Factuality Score — 0–5 scale (low to high credibility)
Conspiracy/Pseudoscience designation — binary flag for low-credibility categorization

Aggregated label¶

A 3-class source-level label: - Unreliable — sources with low/very-low factual reporting OR flagged as conspiracy/pseudoscience - Mixed — sources with mixed factual reporting score - Reliable — sources with high/very-high factual reporting

Format¶

SQLite database: Two tables: newsdata (articles) and tweet (embedded tweets)

JSON format: One JSON file per source containing list of articles

Article columns: id, date, source, title, content, author, published, published_utc, collection_utc, url

Tweet columns: id, article_id, embedded_tweet (raw HTML)

Availability¶

Dataset: https://doi.org/10.7910/DVN/CHMUYZ

Python extraction code: https://github.com/MELALab/nela-gt

NELA-GT-2019 — predecessor dataset (2019 data, 1.12M articles)
NELA-GT-2018 — foundational dataset (2018 data, 713K articles)
FakeNewsNet — social-media-based dataset with engagement features
ReCOVery — COVID-19-specific news credibility dataset
CHECKED — Chinese-language COVID-19 misinformation dataset

Use cases¶

The dataset supports several research directions:

Event-driven narrative analysis — tracking coverage of COVID-19 and the 2020 U.S. Presidential Election across reliable, mixed, and unreliable sources
Robust veracity detection over time — longitudinal evaluation of fake news detection models using three consecutive years of data (2018–2020)
News-social media dynamics — leveraging 410K embedded tweets to understand how social media content is incorporated into news articles and how this varies by source reliability
Health misinformation — dedicated collection of health-related news enabling analysis of COVID-19 misinformation and disinformation tactics
Media manipulation tactics — examining how unreliable outlets changed coverage strategies during major 2020 events

Notes¶

NELA-GT-2020 differs from NELA-GT-2019 in: (1) 258 additional sources (from 261 to 519), mostly fringe/unreliable outlets; (2) expanded topic scope from primarily political news to include health-related articles; (3) embedded tweets (410K tweets) providing novel news-social media linking; (4) simplified ground truth relying solely on MBFC labels (vs. seven assessment sites in 2019) due to MBFC's comprehensiveness and availability; (5) larger scale (1.78M vs. 1.12M articles). The three NELA datasets combined provide 3.5+ years of longitudinal news data (2018–2020) for studying concept drift, robustness, and long-term misinformation dynamics.