NELA-GT-2018: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles¶
Authors: Jeppe Nørregaard, Benjamin D. Horne, Sibel Adalı
Affiliation: Technical University of Denmark, Rensselaer Polytechnic Institute
Venue: arXiv preprint
ArXiv: 1904.01546
Dataset: https://doi.org/10.7910/DVN/ULHLCB
TL;DR¶
NELA-GT-2018 is a large, engagement-independent collection of 713,534 news articles from 194 sources across 10 months (Feb–Nov 2018). It combines multi-source-level ground truth labels from eight independent assessment sites (NewsGuard, Pew Research, Wikipedia, OpenSources, Media Bias/Fact Check, AllSides, BuzzFeed News, PolitiFact) covering multiple dimensions of veracity including reliability, bias, transparency, and consumer trust. The dataset addresses key gaps in existing misinformation research: large scale, diverse sources, non-engagement-driven collection, and multi-dimensional ground truth.
Contributions¶
- Large-scale, engagement-independent dataset addressing the gap in publicly available labeled news datasets; collected directly from news source RSS feeds independent of social media.
- Multi-dimensional ground truth labels combining eight different assessment sites, providing source-level labels across reliability, bias, transparency, journalistic standards, and consumer trust dimensions.
- Diverse source coverage including mainstream, hyper-partisan, conspiracy, and satire sources with balanced temporal and spatial representation.
- Machine-readable format with SQLite database and CSV labels enabling immediate use in machine learning studies via distant supervision.
Dataset Collection¶
Scope: 713,534 articles from 194 news and media outlets collected Feb 1–Nov 30, 2018.
Collection method: RSS feed scraping twice daily using Python libraries feedparser and goose. Started with 92 sources from the prior NELA2017 dataset, expanded using Google Search API queries to find sources publishing similar articles.
Source diversity: - Mainstream outlets (ABC, BBC, CNN, Fox News, Reuters, etc.) - Alternative sources (Breitbart, Infowars, Russia Today, etc.) - Hyper-partisan and conspiracy sources - Multiple geographic origins (primarily English-language)
Temporal coverage: Near-complete daily coverage with two gaps visible in temporal distribution due to scraping issues.
Ground Truth Labeling¶
Eight independent assessment platforms provided source-level veracity labels:
- NewsGuard — trained journalists assess credibility with 9 granular binary labels (false content, responsible gathering, corrections, opinion/news distinction, deceptive headlines, ownership disclosure, advertising labels, editorial accountability, content creator information) summing to 100-point credibility score.
- Pew Research Center — trust ratings aggregated from five political groups (consistently-liberal to consistently-conservative).
- Wikipedia — curated list of fake news sites (intentionally publishing hoaxes/disinformation).
- OpenSources — expert-labeled sources with 1–3 tags each (reliable, blog, clickbait, rumor, fake, unreliable, biased, conspiracy, hate speech, junk science, political, satire, state news).
- Media Bias/Fact Check (MBFC) — four-category numerical evaluation (biased wording/headlines, factual/sourcing, story choices, political affiliation) averaged for final verdict; also provides factual reporting label.
- AllSides — bias assessment (left, center, right) with community feedback mechanisms.
- BuzzFeed News — political leaning labels (left, right) from 2017 election dataset.
- PolitiFact — aggregated truthfulness counts of statements attributed to each source.
Coverage: 154 of 194 sources receive labels from at least one assessment site; remaining 40 sources unlabeled.
Use Cases¶
Distant supervised learning — Source-level labels serve as proxies for article-level labels, enabling large-scale machine learning without expensive per-article fact-checking; allows real-time updating for new articles from known sources.
Semi-supervised learning — Combines consistent labels from 100+ sources with unlabeled and mixed-veracity sources to improve generalization and domain robustness.
Mixed-method studies — Dataset spans multiple major events (e.g., 2018 election, Cambridge Analytica, Kavanaugh hearings), enabling analysis of tactical evolution, narrative consistency, and event-specific dynamics in false-news production.
Design Decisions & Novelty¶
Engagement-independent collection — Unlike FakeNewsCorpus and social-media-based datasets, this dataset directly scrapes news source feeds, avoiding bias toward high-engagement misinformation while capturing low-visibility false content tactics.
Multi-source labels — While some prior datasets use single assessment schemes, NELA-GT-2018 combines eight independent perspectives, capturing multiple dimensions of source credibility (factuality, bias, transparency, trust) rather than reducing to binary true/false labels.
Long temporal window — 10-month collection period provides longitudinal perspective compared to event-specific datasets; sources published ~3,700 articles per day on average.
Connections¶
- Successor to NELA2017 — predecessor dataset with 136K articles from 92 sources in 2017; added natural language features and Facebook engagement.
- Predecessor to NELA-GT-2019 — extends 2018 data with 1.12M articles from 260 sources in 2019.
- Complements FakeNewsNet — which provides social context and engagement signals on a smaller set of fact-checked articles.
- Related to LIAR dataset — statement-level fact-checking; NELA-GT uses source-level labels to avoid repeated fact-checking bottleneck.
- Shares ground-truth methodology with Baly et al. 2018 — both use source-level labels as alternative to expensive article-level annotation.
Notes¶
Strengths: - Largest publicly available dataset at the time combining scale (713K articles), temporal length (10 months), and multi-dimensional source labels. - Engagement-independent collection captures full spectrum of news producer output, not just viral content. - Clear methodology for corroborating labels across eight assessment sites addresses reproducibility.
Limitations: - Source-level labels assigned uniformly to all articles from a source, ignoring within-source variation (some reliable sources may publish occasional false stories; vice versa). - Label sparsity: 40 sources receive no ground truth from any assessment site. - NewsGuard and other assessment sites may have methodological differences and potential biases in source selection. - Focus on English-language content limits geographic scope.
Methodological note: The shift from article-level to source-level labels is a pragmatic solution to the labeled data bottleneck but assumes homogeneity within sources that may not always hold. The paper's argument for this tradeoff is sound for large-scale ML applications where per-article labeling is infeasible.