Fakeddit¶

Fakeddit is a large-scale multimodal fake news detection benchmark with over 1 million Reddit submissions labeled for 2-way (fake/true), 3-way (true/misleading-true/false), and 6-way (true, satire/parody, misleading content, imposter content, false connection, manipulated content) classification.

Key Statistics¶

Total samples: 1,063,106
Fake samples: 628,501
True samples: 527,049
Multimodal samples (text+image): 682,996 (64%)
Subreddits: 22
Unique users: 358,504
Unique domains: 24,203
Vocabulary size: 175,566
Timespan: March 19, 2008 – October 24, 2019

Partitions¶

Set	Count
Training	878,218
Validation	92,444
Released test	92,444
Unreleased test	92,444

Labeling Scheme¶

2-Way Classification¶

Fake: 1 if submission belongs to satirical, misleading, imposter, false connection, or manipulated subreddit
True: 1 if submission belongs to true-content subreddit

3-Way Classification¶

True: Accurate content
Misleading with true text: Fake images/context with true textual claims (e.g., propaganda posters)
False: Fully false content (text and image)

6-Way Classification (Fine-grained)¶

True: Accurate text and images
Satire/Parody: Satirical spin on true content (e.g., The Onion)
Misleading Content: Intentionally manipulated to deceive
Imposter Content: Bot-generated content mimicking other subreddits
False Connection: Images that don't support their captions
Manipulated Content: Manually edited/doctored images (Photoshop)

Data Sources (22 Subreddits)¶

True content (8): photoshopbattles, nottheonion, neutralnews, pic, usanews, upliftingnews, mildlyinteresting, usnews

Satire/Parody (4): fakealbumcovers, satire, waterfordwhispersnews, theonion

Misleading Content (3): propagandaposters, fakefacts, savedyouaclick

False Connection (4): misleadingthumbnails, confusing_perspective, pareidolia, fakehistoryporn

Imposter Content (2): subredditsimulator, subsimulatorgpt2

Manipulated Content (1): photoshopbattles (comments)

Quality Assurance¶

Subreddit moderation: First-pass filtering by community moderators
Score threshold: Only submissions with score ≥ 1 retained (filters downvoted, off-topic content)
Manual validation: 10 random posts per subreddit manually checked to confirm thematic consistency
Text cleaning: Punctuation, numbers, and subreddit-revealing keywords removed; lowercase conversion

Inter-annotator agreement: Cohen's Kappa = 0.54 on 150 manually labeled text-image pairs for 6-way classification, indicating moderate agreement and genuine ambiguity in category boundaries.

Modalities¶

Text: Submission titles (mean 8.27 words, up to ~100 words)
Image: Submission thumbnails + photoshop comments
Metadata: Score, author, subreddit, domain, upvote/downvote counts
Comments: User engagement data (mean 17.94 comments per submission)

Baseline Results¶

6-way classification (best performing model: BERT + ResNet50 with maximum fusion): - Validation accuracy: 86.00% - Test accuracy: 85.88%

Comparison with single modalities: - BERT text-only: 76.96% - ResNet50 image-only: 75.29% - Multimodal advantage: ~10 percentage points

Per-category accuracy (6-way, ResNet50 image features): - Manipulated: ~97% (easiest) - True: ~85% - Misleading: ~81% - False Connection: ~78% - Satire: ~67% - Imposter: ~52% (hardest)

Papers in this Wiki that Use Fakeddit¶

Nakamura et al. (2019) — r/Fakeddit Dataset Paper

Notes¶

The dataset employs distant supervision via subreddit labels; individual samples are not manually annotated. While this enables scale, the moderate inter-annotator agreement (Cohen's Kappa = 0.54) indicates genuine ambiguity—some samples legitimately belong to multiple categories (e.g., a satirical false connection). Satire and imposter content remain challenging even for multimodal models, suggesting these categories require deeper contextual understanding. Class imbalance in 6-way setting (true samples overrepresented) biases predictions toward the true label; researchers should account for this in evaluation. The dataset includes metadata and user comments not explored in baseline experiments, offering potential for future social-context-based detection research.