Skip to content
Sampling the News Producers: A Large News and Feature Data Set for the Study of the Complex Media Landscape

Sampling the News Producers: A Large News and Feature Data Set for the Study of the Complex Media Landscape

Authors: Benjamin D. Horne, Sara Khedr, Sibel Adalı

Venue: ICWSM 2018 (Twelfth International AAAI Conference on Web and Social Media)

ArXiv: 1803.10124

TL;DR

Introduces NELA2017, a dataset of 1,586 news articles from 92 diverse news sources over 7 months (April–October 2017), including mainstream outlets, hyper-partisan sources, satire, and deliberately false news sites. Computes 130 content-based linguistic and engagement features on each article, enabling large-scale comparative analysis of news source behavior and characterization.

Contributions

  • Large-scale news source dataset — 1,586 articles across 92 distinct news sources, addressing gap in publicly available, diverse media landscape samples
  • Comprehensive feature engineering — extraction of 130 content-based features spanning structure (POS, clickbait detection), sentiment, emotion, engagement (Facebook stats), bias, morality, and lexical characteristics
  • Diverse source taxonomy — balanced representation across mainstream news, hyper-partisan political blogs, satire outlets, and known producers of deliberately false content
  • Foundation for NELA series — establishes methodology and source lexicon extended by NELA-GT-2018, NELA-GT-2019, and beyond

Dataset Description

Scope: 1,586 articles from 92 news sources collected April 1 – October 31, 2017.

Source diversity:

Sources selected to span the full media landscape: - Established news outlets (CNN, New York Times, BBC, Reuters, etc.) — ~50% of sources - Hyper-partisan political sources (Breitbart, The Young Turks, etc.) — representing left and right political spectrum - Satire outlets (The Onion, Babylon Bee, etc.) - Known misinformation sources — sites documented as publishers of false, misleading, or conspiracy content

Article metadata collected: - Full article text and title - Source name and URL - Author information (when available) - Publication timestamp (UTC) - HTML markup (for raw feature extraction) - Facebook engagement statistics (shares, comments, reactions)

Temporal distribution: Near-complete daily coverage with some collection gaps; balanced sampling across time period to avoid bias toward trending topics.

Feature Engineering

130 features computed on title and article body text, organized into seven categories:

Structure Features (14 features)

  • POS (Part-of-Speech) n-grams — normalized counts across word classes
  • Clickbait detection via fine-tuned RNN classifier
  • Linguistic diversity (Type-Token Ratio, Flesch-Kincaid readability)

Sentiment Features (3 features)

  • Negative, positive, neutral sentiment scores via lexicon-based methods (TextBlob)
  • Computed separately for title and body

Emotion Features (13 features)

  • Categorical emotion detection (joy, sadness, anger, fear, surprise) via lexicons (EmoLex, LIWC)
  • Strong/weak emotion indicators
  • Distinct emotion word counts

Engagement Features (3 features)

  • Facebook shares, comments, reactions (aggregate from Facebook API per article)

Complexity Features (7 features)

  • Bias lexicons (LIWC, training data from Recasens et al.)
  • Morality lexicons (Moral Foundation Theory)
  • Lexical redundancy and diversity (character/word statistics)

All features computed on both title and body text separately, then combined, yielding ~260 total dimensions per article.

Descriptive Results

Source characteristics:

Ranking sources by average subjectivity (writing style), clickbait prevalence, sentiment, and readability reveals: - Hyper-partisan and satire sources score higher on subjectivity - Satire outlets show highest use of clickbait titles - Tabloid and lower-credibility sources lean more negative in sentiment - Established news outlets maintain higher readability (lower Flesch-Kincaid grade)

Feature correlations:

Positive correlations observed between: - Negativity and clickbait titles - Bias lexicon scores and hyper-partisan source classification - Engagement (Facebook shares) and subjective writing style

No strong single-feature discrimination between source types; classifiers require combination of features.

Use Cases

  • News source characterization — quantify stylistic, emotional, and bias differences across sources
  • Engagement prediction — predict Facebook engagement from linguistic features
  • Fake news detection — features as input to ML classifiers distinguishing misinformation sources
  • Media landscape comparison — systematic comparison of writing styles, biases, and narratives across left/right political spectrum
  • Temporal analysis — track feature evolution as news producers respond to events and audience reactions

Design Decisions & Novelty

Feature-rich representation — Unlike prior fake news datasets focused on binary labels or social context, NELA2017 emphasizes content-level linguistic and structural features, enabling direct analysis of how misinformation differs from reliable reporting.

Source-level diversity — Carefully curated lexicon of 92 sources spanning political spectrum, credibility levels, and content types; avoids engagement bias (e.g., focusing only on viral content).

Temporal real-world data — Collection from actual publication timelines rather than curated lists or experiments, capturing authentic editorial decisions and audience interaction.

Connections

  • Succeeded by NELA-GT-2018 — expands to 713K articles from 194 sources with multi-source ground truth labels
  • Succeeded by NELA-GT-2022 — further expansion to 1.78M articles with refined collection methodology
  • Related to FakeNewsNet — provides social propagation context on subset of fact-checked articles; NELA2017 focuses on source characterization at scale
  • Related to LIAR dataset — statement-level fact-checking; NELA2017 operates at article/source level with linguistic features
  • Influenced methodology in Potthast et al. 2017 on hyperpartisan news detection — stylometric features from similar linguistic tradition

Notes

Strengths: - Large feature set enables diverse downstream tasks and cross-source analysis - Transparent methodology for source selection and feature extraction - Foundation for subsequent releases (GT-2018, GT-2019, GT-2022) shows long-term utility - Balanced representation across credibility and partisan spectrum

Limitations: - Relatively small dataset (1,586 articles) compared to modern standards; later NELA versions address this - No per-article ground truth labels (veracity, fact-check status); relies on source-level classification - Feature engineering assumes English text; not multilingual - Facebook engagement data may reflect platform-specific virality patterns rather than content quality - Collection period (April–Oct 2017) captures specific historical moment; generalization to other time periods unclear

Methodological note: The shift from article-level fact-checking to source-level characterization is pragmatic — establishing reliable ground truth at scale is expensive. The linguistic feature approach trades explicit labels for rich content analysis that reveals how sources differ, rather than just whether they're correct.