NELA-GT-2018¶
Creators: Jeppe Nørregaard, Benjamin D. Horne, Sibel Adalı — Technical University of Denmark, Rensselaer Polytechnic Institute
Citation: Nørregaard, Horne & Adalı (2019)
DOI: https://doi.org/10.7910/DVN/ULHLCB
Overview¶
NELA-GT-2018 is a large, engagement-independent news dataset comprising 713,534 news articles from 194 sources collected over 10 months (February 1 – November 30, 2018). It combines source-level ground truth labels from eight independent assessment sites covering multiple dimensions of credibility including reliability, bias, transparency, adherence to journalistic standards, and consumer trust. The dataset was designed to address gaps in prior misinformation research by providing large scale, diverse coverage independent of social media engagement signals.
Data¶
Size: 713,534 articles from 194 news and media outlets
Collection period: February 1 – November 30, 2018 (10 months)
Language: English
Collection method: RSS feed scraping twice daily via Python feedparser and goose libraries, directly from source websites
Source categories: - Mainstream news outlets (broadcast, print, wire services) - Alternative news sources - Hyper-partisan sources - Conspiracy and pseudoscience sites - Satire outlets
Temporal coverage: Near-complete daily data with exception of two periods with scraping issues; ~3,700 articles per day on average
Labels¶
Source-level ground truth from eight independent assessment platforms:
- NewsGuard — credibility assessment from trained journalists; 9 granular binary labels summing to 100-point credibility score
- Pew Research Center — trust ratings from five political groups (consistently-liberal to consistently-conservative)
- Wikipedia — curated list of fake news sites
- OpenSources — expert labels with 1–3 tags per source (reliable, fake, conspiracy, etc.)
- Media Bias/Fact Check (MBFC) — factual reporting score and bias classification
- AllSides — bias assessment (left, center, right) with community feedback
- BuzzFeed News — political leaning (left, right)
- PolitiFact — truthfulness statistics from fact-checked statements
Coverage: 154 of 194 sources have labels from at least one assessment site; 40 sources remain unlabeled
Label dimensions: Reliability (good/poor credibility), bias (left/center/right), factual reporting quality, transparency, journalistic standards, consumer trust
Format¶
Database: SQLite with one row per article
Fields per article: date, source, title, cleaned text content
Label format: CSV with rows = sources, columns = labels from all assessment sites
Availability¶
Dataset: https://doi.org/10.7910/DVN/ULHLCB
Related datasets¶
- NELA2017 — predecessor (136K articles, 92 sources, 2017 data); includes natural language features and Facebook engagement metrics
- NELA-GT-2019 — successor (1.12M articles, 260 sources, 2019 data); adds 66 sources and updated assessment labels
- FakeNewsNet — social-media-based dataset with engagement features; smaller but includes retweet cascades and user profiles
- FakeNewsCorpus — 10M articles labeled via opensources.co; much larger but lower quality source labeling
Use cases¶
Distant supervised learning — source-level labels enable large-scale machine learning without expensive per-article fact-checking; real-time parameter updates for articles from known sources
Semi-supervised learning — leverage unlabeled and mixed-veracity sources to improve robustness; includes 40 unlabeled sources for semi-supervised approaches
Multi-method studies — spans multiple major events (2018 election, Cambridge Analytica, Kavanaugh hearings) enabling analysis of false-news producer tactics over time and across events
Concept drift — 10-month temporal span enables analysis of how false news detection models degrade over time with automatically extracted features
Unique strengths¶
- Engagement-independent: collected directly from source feeds, not social media; captures low-visibility misinformation tactics.
- Multi-dimensional labels: eight assessment perspectives rather than single binary label; captures reliability, bias, transparency, trust, journalistic standards.
- Long temporal window: 10-month collection provides longitudinal perspective compared to event-specific datasets.
- Diverse sources: 194 sources spanning mainstream, alternative, hyper-partisan, and conspiracy categories.
Design notes¶
Source-level labeling: All articles from a source receive the same label, addressing the article-level labeling bottleneck. Tradeoff: ignores within-source variation (reliable sources may publish occasional false stories). Justified for large-scale machine learning where per-article annotation infeasible.
Corroboration methodology: Combining eight assessment sites captures different evaluation criteria and methodologies, reducing single-assessor bias. Sites apply different methods: NewsGuard uses trained journalists, Pew uses survey aggregation, Wikipedia uses crowd editing, OpenSources uses expert curation, MBFC uses numerical multi-category evaluation, AllSides uses data-driven approaches plus community feedback, BuzzFeed uses manual classification, PolitiFact uses statement-level aggregation.