Skip to content

Fake And Real News Dataset

Access: GitHub — GeorgeMcIntire/fake_real_news_dataset

Description

A binary fake/real news classification dataset assembled from two sources. The fake news component derives from a Kaggle dataset of 13,000 articles labeled "fake news" released during the 2016 US election cycle. The real news component was gathered from All Sides, a platform aggregating news and opinion across the political spectrum, where articles are labeled by topic and political leaning; 5,279 real news articles published in 2015 or 2016 were scraped from prominent outlets including the New York Times, WSJ, Bloomberg, NPR, and the Guardian.

The dataset was constructed to achieve near-class balance: null accuracy is 50%.

Statistics

Split Fake Real Total
Train 1,154 1,143 2,297
Validation 592 557 1,149
Test 551 597 1,148
Total 2,297 2,297 10,558

Train / Validation / Test split ratio: 50% / 25% / 25%.

Schema

Field Description
label News label: Fake or Real
text Full article body
title Article headline

Intended use

  • Binary fake/real news classification
  • Benchmarking detection methods under class-balanced conditions
  • English-language news article analysis

Limitations

The fake news articles come from a single Kaggle release tied to the 2016 US election cycle, which may limit temporal and topical diversity. Publisher-level rather than per-article labels are used for the real news portion. The dataset covers only English-language articles.

Connections

  • ReCOVery is a structurally similar binary-label dataset but focused on COVID-19 news with publisher-level NewsGuard/MBFC credibility scores rather than election-era Kaggle fake labels.
  • Cao et al. (2025) — SLIM uses this dataset as a primary benchmark for limited-information fake news detection, alongside ReCOVery.