NELA-GT-2020¶
Creators: Maurício Gruppi, Benjamin D. Horne, Sibel Adalı — Rensselaer Polytechnic Institute
Citation: [[2021-gruppi-nela-gt-2020|Gruppi, Horne & Adalı (2021)]]
DOI: https://doi.org/10.7910/DVN/CHMUYZ
Overview¶
NELA-GT-2020 is a large multi-labelled news dataset comprising 1.78M news articles from 519 sources collected throughout 2020 (January 1–December 31). It is an updated successor to NELA-GT-2019, nearly doubling the number of sources and introducing a novel feature: embedded tweets found within news articles. The dataset captures two major 2020 events—the COVID-19 pandemic and the U.S. Presidential Election—with expansions into health-related news beyond the primarily political coverage of previous versions.
Data¶
Size: 1.78M articles from 519 news sources
Collection period: 2020 (full calendar year, with minor outage in weeks 13–15)
Language: English
Collection method: RSS feed scraping twice daily via feedparser and goose libraries
Source categories: - Mainstream news outlets (left, center, right bias) - Alternative news sources - Conspiracy-driven and pseudoscience media (258 additional sources vs. 2019)
Temporal coverage: Weekly aggregation shows collection activity across 52 weeks; weeks 13–15 (March 25–April 8) experienced technical outages; estimated ~15K missing articles (~0.8% of dataset)
Embedded Tweets¶
Novel feature: 410,432 embedded tweets extracted from the news articles
- Tweets feature URLs, publication dates, author information
- Single article may contain multiple tweets; single tweet may appear in multiple articles
- Raw HTML stored in
tweettable with foreign key to parent article - Enables network analysis of article-tweet relationships
Labels¶
Source-level ground truth labels from Media Bias/Fact Check (MBFC) covering two dimensions of veracity:
- MBFC Factuality Score — 0–5 scale (low to high credibility)
- Conspiracy/Pseudoscience designation — binary flag for low-credibility categorization
Aggregated label¶
A 3-class source-level label: - Unreliable — sources with low/very-low factual reporting OR flagged as conspiracy/pseudoscience - Mixed — sources with mixed factual reporting score - Reliable — sources with high/very-high factual reporting
Format¶
SQLite database: Two tables: newsdata (articles) and tweet (embedded tweets)
JSON format: One JSON file per source containing list of articles
Article columns: id, date, source, title, content, author, published, published_utc, collection_utc, url
Tweet columns: id, article_id, embedded_tweet (raw HTML)
Availability¶
Dataset: https://doi.org/10.7910/DVN/CHMUYZ
Python extraction code: https://github.com/MELALab/nela-gt
Related datasets¶
- NELA-GT-2019 — predecessor dataset (2019 data, 1.12M articles)
- NELA-GT-2018 — foundational dataset (2018 data, 713K articles)
- FakeNewsNet — social-media-based dataset with engagement features
- ReCOVery — COVID-19-specific news credibility dataset
- CHECKED — Chinese-language COVID-19 misinformation dataset
Use cases¶
The dataset supports several research directions:
- Event-driven narrative analysis — tracking coverage of COVID-19 and the 2020 U.S. Presidential Election across reliable, mixed, and unreliable sources
- Robust veracity detection over time — longitudinal evaluation of fake news detection models using three consecutive years of data (2018–2020)
- News-social media dynamics — leveraging 410K embedded tweets to understand how social media content is incorporated into news articles and how this varies by source reliability
- Health misinformation — dedicated collection of health-related news enabling analysis of COVID-19 misinformation and disinformation tactics
- Media manipulation tactics — examining how unreliable outlets changed coverage strategies during major 2020 events
Notes¶
NELA-GT-2020 differs from NELA-GT-2019 in: (1) 258 additional sources (from 261 to 519), mostly fringe/unreliable outlets; (2) expanded topic scope from primarily political news to include health-related articles; (3) embedded tweets (410K tweets) providing novel news-social media linking; (4) simplified ground truth relying solely on MBFC labels (vs. seven assessment sites in 2019) due to MBFC's comprehensiveness and availability; (5) larger scale (1.78M vs. 1.12M articles). The three NELA datasets combined provide 3.5+ years of longitudinal news data (2018–2020) for studying concept drift, robustness, and long-term misinformation dynamics.