Fake And Real News Dataset¶
Access: GitHub — GeorgeMcIntire/fake_real_news_dataset
Description¶
A binary fake/real news classification dataset assembled from two sources. The fake news component derives from a Kaggle dataset of 13,000 articles labeled "fake news" released during the 2016 US election cycle. The real news component was gathered from All Sides, a platform aggregating news and opinion across the political spectrum, where articles are labeled by topic and political leaning; 5,279 real news articles published in 2015 or 2016 were scraped from prominent outlets including the New York Times, WSJ, Bloomberg, NPR, and the Guardian.
The dataset was constructed to achieve near-class balance: null accuracy is 50%.
Statistics¶
| Split | Fake | Real | Total |
|---|---|---|---|
| Train | 1,154 | 1,143 | 2,297 |
| Validation | 592 | 557 | 1,149 |
| Test | 551 | 597 | 1,148 |
| Total | 2,297 | 2,297 | 10,558 |
Train / Validation / Test split ratio: 50% / 25% / 25%.
Schema¶
| Field | Description |
|---|---|
label |
News label: Fake or Real |
text |
Full article body |
title |
Article headline |
Intended use¶
- Binary fake/real news classification
- Benchmarking detection methods under class-balanced conditions
- English-language news article analysis
Limitations¶
The fake news articles come from a single Kaggle release tied to the 2016 US election cycle, which may limit temporal and topical diversity. Publisher-level rather than per-article labels are used for the real news portion. The dataset covers only English-language articles.
Connections¶
- ReCOVery is a structurally similar binary-label dataset but focused on COVID-19 news with publisher-level NewsGuard/MBFC credibility scores rather than election-era Kaggle fake labels.
- Cao et al. (2025) — SLIM uses this dataset as a primary benchmark for limited-information fake news detection, alongside ReCOVery.