Skip to content
Misinformation Detection on YouTube Using Video Captions

Misinformation Detection on YouTube Using Video Captions

Authors: Raj Jagtap, Abhinav Kumar, Rahul Goel, Shakshi Sharma, Rajesh Sharma, Clint P. George

Venue: arXiv preprint, 2021 — arXiv:2107.00941

TL;DR

This work proposes using video captions as text features for classifying YouTube videos as misinformation, debunking misinformation, or neutral content. Across five conspiracy topics (vaccines, 9/11, chemtrails, moon landing, flat earth), the authors achieve 0.85–0.90 F1-score on three-class classification and 0.92–0.95 F1-score when binarizing to misinformation vs. others. Pre-trained word embeddings (GloVe, Word2Vec) with classical ML classifiers outperform relying on metadata alone (views, likes, comments).

Contributions

  • Demonstrates that video metadata (views, likes, comments) alone cannot reliably distinguish misinformation from debunking content
  • Proposes extracting and preprocessing YouTube captions as a rich text signal for misinformation detection
  • Releases a YouTube Caption Scraper tool that handles both manual and auto-generated captions
  • Benchmarks four state-of-the-art pre-trained word embeddings (GloVe Wikipedia 100D/300D, Word2Vec Google News 300D, Word2Vec Twitter 200D) with 28 classical classifiers
  • Introduces a scoring method to rank embeddings by average top-T classifier performance, handling outliers and enabling fair cross-embedding comparison
  • Applies SMOTE to address dataset class imbalance in training

Method

The authors collect 2,943 YouTube videos across five conspiracy topics. Each video is labeled as misinformation (87–140 videos), debunking misinformation (44–237), or neutral (226–473). They download captions for 2,175 videos (some deleted by YouTube, some missing captions). Captions are preprocessed by removing special characters and stop-words, and filtering videos with <500 characters.

Video captions are converted to fixed-dimensional vectors using weighted averages of pre-trained word embeddings: - GloVe Wikipedia (100D and 300D) - Word2Vec Google News (300D) - Word2Vec Twitter (200D)

They train two sets of classifiers: 1. Three-class (Misinformation, Debunking, Neutral) 2. Binary (Misinformation vs. others), which emphasizes the target class

All models use SMOTE to resample minority classes during training. Evaluation uses weighted F1-score, precision, recall, and AUC-ROC on held-out test sets.

Results

Three-class classification: - Best F1-scores: 0.85–0.90 across topics - Best embeddings vary by topic; Google News 300D and GloVe 300D generally most effective - NuSVC (Vaccines, Moon Landing), ExtraTreesClassifier (9/11), CalibratedClassifierCV (Chemtrails), RandomForestClassifier (Flat Earth) perform best

Binary classification (Misinformation vs. others): - Best F1-scores: 0.92–0.95 - AUC-ROC: 0.74–0.90 - SVC and XGBoost dominate best-model rankings - Google News 300D and GloVe 300D remain strongest embeddings

Metadata-only baseline (views, likes, dislikes, comments) is uninformative; classes show overlapping distributions. Caption-based models substantially outperform this.

Connections

Notes

Strengths: - Clear motivation from descriptive analysis showing metadata alone insufficient - Comprehensive embedding/classifier grid search with principled scoring - Public dataset and code release (GitHub) - Balanced treatment of practical constraints (caption availability, auto-generated vs. manual)

Limitations: - Only English captions; relies on YouTube's translation when needed - Pre-trained embeddings trained on Wikipedia/News/Twitter, not video text; future work suggests training embeddings directly on captions - SMOTE addresses training imbalance but may not reflect real-world class distributions - Limited to five politically charged topics; generalization to other misinformation domains unclear - No comparison to modern deep-learning baselines (e.g., BERT, RoBERTa) or end-to-end video models

Open questions: - How do results vary with caption quality (manual vs. auto-generated)? - Would topic-specific embeddings or domain adaptation improve cross-topic performance? - Does fine-tuning pre-trained language models (e.g., RoBERTa) on captions outperform classical embeddings?