Fake And Real News Dataset¶

Access: GitHub — GeorgeMcIntire/fake_real_news_dataset

Description¶

A binary fake/real news classification dataset assembled from two sources. The fake news component derives from a Kaggle dataset of 13,000 articles labeled "fake news" released during the 2016 US election cycle. The real news component was gathered from All Sides, a platform aggregating news and opinion across the political spectrum, where articles are labeled by topic and political leaning; 5,279 real news articles published in 2015 or 2016 were scraped from prominent outlets including the New York Times, WSJ, Bloomberg, NPR, and the Guardian.

The dataset was constructed to achieve near-class balance: null accuracy is 50%.

Statistics¶

Split	Fake	Real	Total
Train	1,154	1,143	2,297
Validation	592	557	1,149
Test	551	597	1,148
Total	2,297	2,297	10,558

Train / Validation / Test split ratio: 50% / 25% / 25%.

Schema¶

Field	Description
`label`	News label: `Fake` or `Real`
`text`	Full article body
`title`	Article headline

Intended use¶

Binary fake/real news classification
Benchmarking detection methods under class-balanced conditions
English-language news article analysis

Limitations¶

The fake news articles come from a single Kaggle release tied to the 2016 US election cycle, which may limit temporal and topical diversity. Publisher-level rather than per-article labels are used for the real news portion. The dataset covers only English-language articles.

Connections¶

ReCOVery is a structurally similar binary-label dataset but focused on COVID-19 news with publisher-level NewsGuard/MBFC credibility scores rather than election-era Kaggle fake labels.
Cao et al. (2025) — SLIM uses this dataset as a primary benchmark for limited-information fake news detection, alongside ReCOVery.