Skip to content
All-in-one: Multi-task Learning for Rumour Verification

All-in-one: Multi-task Learning for Rumour Verification

Authors: Elena Kochkina, Maria Liakata, Arkaitz Zubiaga

Affiliation: University of Warwick, Alan Turing Institute

Venue: arXiv, June 2018

arXiv: 1806.03713

TL;DR

This paper proposes a multi-task learning framework for rumor verification that jointly trains veracity classification (the main task) with auxiliary tasks—rumor detection and stance classification. The approach outperforms single-task baselines on both RumourEval and PHEME datasets, demonstrating that sharing representations across related subtasks improves veracity prediction, particularly on datasets with lower kurtosis in label distributions.

Contributions

  • Multi-task learning framework for rumor verification: Demonstrates that joint training of veracity classification with stance classification and rumor detection improves performance on the main task
  • Empirical analysis of task combinations: Evaluates three multi-task scenarios (veracity+stance, veracity+detection, all three) and shows veracity+stance produces the strongest improvements
  • Link between data properties and MTL effectiveness: Analyzes kurtosis, entropy, and token-type ratio properties of datasets and correlates them with multi-task learning outcomes—lower kurtosis events benefit more from multi-task approaches
  • Evaluation on two rumor datasets: Comprehensive experiments on PHEME and RumourEval with leave-one-event-out cross-validation, showing macro F-score improvements of 5–15% points depending on dataset and auxiliary task combination

Method

Rumor resolution pipeline

The framework decomposes rumor verification into four sequential subtasks (following Zubiaga et al. 2018):

  1. Detection: Binary classification of rumor vs. non-rumor
  2. Tracking: Collection and threading of source and responses
  3. Stance classification: Labeling tweet stance as Supporting, Denying, Questioning, or Commenting
  4. Veracity classification: Determining if the rumor is True, False, or Unverified (the ultimate goal)

Sequential baseline: branchLSTM

The baseline model extends the branch-LSTM approach from Kochkina et al. 2017:

  • Decomposes Twitter conversations into linear branches
  • Represents each tweet as the mean of word2vec embeddings (pre-trained on Google News, 300d)
  • Feeds branch sequences through LSTM layers and applies softmax for per-tweet predictions
  • Uses majority voting on branch predictions to obtain thread-level labels
  • Trained with categorical cross-entropy loss

Multi-task learning approach

The architecture uses hard parameter sharing: a shared LSTM layer fed by processed tweet sequences, with task-specific output layers:

Input (tweet branches)
Shared LSTM (1–2 layers)
Task-specific layers (Dense + ReLU + Softmax for each task)
    ├→ Stance classification (per-tweet predictions)
    ├→ Veracity classification (per-branch aggregation for RumourEval, per-thread for PHEME)
    └→ Rumor detection (binary, per-thread)

Three multi-task configurations tested: - MTL2 Veracity+Stance: Joint learning of veracity (main) and stance (auxiliary) - MTL2 Veracity+Detection: Joint learning of veracity and rumor detection - MTL3 (all three): Joint learning of all three tasks

Loss function: weighted sum of individual task losses, with equal weights to all tasks. Loss is categorical cross-entropy. For imbalanced datasets, macro F-score and accuracy both reported.

Features and preprocessing

  • Word representations: word2vec embeddings (300d) from Google News
  • Aggregation: Mean pooling of word embeddings per tweet
  • Hyperparameters: Optimized via Tree of Parzen Estimators (TPE):
  • Dense ReLU layers: 1–4
  • LSTM units: 100–300
  • Dropout: 50% before output
  • L2 regularization: 10^{-3}–10^{-5}
  • Mini-batch: 32; epochs: 50

Results

RumourEval dataset

Metric Majority Baseline NileTMRG* branchLSTM MTL2 (Veracity+Stance) MTL2 (Veracity+Detection) MTL3 (all)
Macro F 0.148 0.539 0.491 0.558 0.571
Accuracy 0.286 0.570 0.500 0.571

MTL2 (Veracity+Stance) achieves 0.558 macro F on veracity classification, outperforming the single-task branchLSTM baseline (0.491) by 13.6%.

PHEME 5 largest events

Metric Majority NileTMRG* branchLSTM MTL2 (Veracity+Stance) MTL2 (Veracity+Detection) MTL3 (all)
Macro F 0.226 0.339 0.454 0.441 0.410 0.492
Accuracy 0.511 0.438 0.454 0.441 0.410 0.492

MTL3 (all three tasks) achieves 0.492 macro F on veracity, showing 8.4% improvement over branchLSTM. However, on RumourEval, additional tasks beyond stance provide diminishing returns.

PHEME 9 events (full dataset)

Metric Majority NileTMRG* branchLSTM MTL2 (Veracity+Stance) MTL2 (Veracity+Detection) MTL3 (all)
Macro F 0.203 0.297 0.314 0.357 0.397 0.405
Accuracy 0.444 0.360 0.314 0.357 0.397 0.405

MTL3 achieves 0.405 macro F, 28.9% improvement over single-task branchLSTM. Detection as an auxiliary task provides more signal on the larger 9-event set.

Per-event analysis

Performance varies significantly by event. Charlie Hebdo (0.327 MTL3), Sydney siege (0.350), and Ferguson (0.189) are hardest due to class imbalance and small event size. Germanwings-crash (0.429) shows strong multi-task gains when events contain all three label classes.

Connections

Notes

Strengths: - Systematic exploration of multi-task learning configurations on rumor verification; shows benefits of auxiliary task selection - Strong empirical improvements on PHEME especially (28.9% on 9 events) - Thoughtful analysis linking dataset properties (kurtosis, entropy, TTR) to multi-task gains—lower kurtosis (more balanced label distributions) benefits more from auxiliary tasks - Clear demonstration that not all auxiliary tasks help equally (stance+veracity > detection+veracity depending on dataset) - Reproducible methodology with hyperparameter tuning via TPE

Limitations: - Models achieve lower performance on 5-event PHEME subset (likely fewer training examples and imbalanced labels), constraining generalization claims - RumourEval improvements are modest (0.491 → 0.558) and single-task branching already competitive with multi-task in some configurations - No analysis of which shared representations are learned or what features the auxiliary tasks extract - Limited to Twitter data; unclear if findings transfer to Reddit (tested in RumourEval 2019) - Veracity prediction still lags behind stance classification (0.558 F is much lower than 0.78+ stance F-scores); multi-task learning helps but the core challenge remains - No ablation on task loss weighting—equal weights may not be optimal; could benefit from loss scheduling or learned task weights - Post-SemEval setup: uses different rumor detection labels than standard PHEME, making direct comparison to prior work difficult

Follow-up work: - RumourEval 2019 extended this to Reddit and cross-platform settings, with newer datasets showing continued gains for auxiliary tasks - Later work explores hierarchical multi-task learning and learned task weighting strategies - Joint stance+veracity learning is now standard in rumor verification shared tasks