Fakeddit¶
Fakeddit is a large-scale multimodal fake news detection benchmark with over 1 million Reddit submissions labeled for 2-way (fake/true), 3-way (true/misleading-true/false), and 6-way (true, satire/parody, misleading content, imposter content, false connection, manipulated content) classification.
Key Statistics¶
- Total samples: 1,063,106
- Fake samples: 628,501
- True samples: 527,049
- Multimodal samples (text+image): 682,996 (64%)
- Subreddits: 22
- Unique users: 358,504
- Unique domains: 24,203
- Vocabulary size: 175,566
- Timespan: March 19, 2008 – October 24, 2019
Partitions¶
| Set | Count |
|---|---|
| Training | 878,218 |
| Validation | 92,444 |
| Released test | 92,444 |
| Unreleased test | 92,444 |
Labeling Scheme¶
2-Way Classification¶
- Fake: 1 if submission belongs to satirical, misleading, imposter, false connection, or manipulated subreddit
- True: 1 if submission belongs to true-content subreddit
3-Way Classification¶
- True: Accurate content
- Misleading with true text: Fake images/context with true textual claims (e.g., propaganda posters)
- False: Fully false content (text and image)
6-Way Classification (Fine-grained)¶
- True: Accurate text and images
- Satire/Parody: Satirical spin on true content (e.g., The Onion)
- Misleading Content: Intentionally manipulated to deceive
- Imposter Content: Bot-generated content mimicking other subreddits
- False Connection: Images that don't support their captions
- Manipulated Content: Manually edited/doctored images (Photoshop)
Data Sources (22 Subreddits)¶
True content (8): photoshopbattles, nottheonion, neutralnews, pic, usanews, upliftingnews, mildlyinteresting, usnews
Satire/Parody (4): fakealbumcovers, satire, waterfordwhispersnews, theonion
Misleading Content (3): propagandaposters, fakefacts, savedyouaclick
False Connection (4): misleadingthumbnails, confusing_perspective, pareidolia, fakehistoryporn
Imposter Content (2): subredditsimulator, subsimulatorgpt2
Manipulated Content (1): photoshopbattles (comments)
Quality Assurance¶
- Subreddit moderation: First-pass filtering by community moderators
- Score threshold: Only submissions with score ≥ 1 retained (filters downvoted, off-topic content)
- Manual validation: 10 random posts per subreddit manually checked to confirm thematic consistency
- Text cleaning: Punctuation, numbers, and subreddit-revealing keywords removed; lowercase conversion
Inter-annotator agreement: Cohen's Kappa = 0.54 on 150 manually labeled text-image pairs for 6-way classification, indicating moderate agreement and genuine ambiguity in category boundaries.
Modalities¶
- Text: Submission titles (mean 8.27 words, up to ~100 words)
- Image: Submission thumbnails + photoshop comments
- Metadata: Score, author, subreddit, domain, upvote/downvote counts
- Comments: User engagement data (mean 17.94 comments per submission)
Baseline Results¶
6-way classification (best performing model: BERT + ResNet50 with maximum fusion): - Validation accuracy: 86.00% - Test accuracy: 85.88%
Comparison with single modalities: - BERT text-only: 76.96% - ResNet50 image-only: 75.29% - Multimodal advantage: ~10 percentage points
Per-category accuracy (6-way, ResNet50 image features): - Manipulated: ~97% (easiest) - True: ~85% - Misleading: ~81% - False Connection: ~78% - Satire: ~67% - Imposter: ~52% (hardest)
Papers in this Wiki that Use Fakeddit¶
Notes¶
The dataset employs distant supervision via subreddit labels; individual samples are not manually annotated. While this enables scale, the moderate inter-annotator agreement (Cohen's Kappa = 0.54) indicates genuine ambiguity—some samples legitimately belong to multiple categories (e.g., a satirical false connection). Satire and imposter content remain challenging even for multimodal models, suggesting these categories require deeper contextual understanding. Class imbalance in 6-way setting (true samples overrepresented) biases predictions toward the true label; researchers should account for this in evaluation. The dataset includes metadata and user comments not explored in baseline experiments, offering potential for future social-context-based detection research.