All-in-one: Multi-task Learning for Rumour Verification¶
Authors: Elena Kochkina, Maria Liakata, Arkaitz Zubiaga
Affiliation: University of Warwick, Alan Turing Institute
Venue: arXiv, June 2018
arXiv: 1806.03713
TL;DR¶
This paper proposes a multi-task learning framework for rumor verification that jointly trains veracity classification (the main task) with auxiliary tasks—rumor detection and stance classification. The approach outperforms single-task baselines on both RumourEval and PHEME datasets, demonstrating that sharing representations across related subtasks improves veracity prediction, particularly on datasets with lower kurtosis in label distributions.
Contributions¶
- Multi-task learning framework for rumor verification: Demonstrates that joint training of veracity classification with stance classification and rumor detection improves performance on the main task
- Empirical analysis of task combinations: Evaluates three multi-task scenarios (veracity+stance, veracity+detection, all three) and shows veracity+stance produces the strongest improvements
- Link between data properties and MTL effectiveness: Analyzes kurtosis, entropy, and token-type ratio properties of datasets and correlates them with multi-task learning outcomes—lower kurtosis events benefit more from multi-task approaches
- Evaluation on two rumor datasets: Comprehensive experiments on PHEME and RumourEval with leave-one-event-out cross-validation, showing macro F-score improvements of 5–15% points depending on dataset and auxiliary task combination
Method¶
Rumor resolution pipeline¶
The framework decomposes rumor verification into four sequential subtasks (following Zubiaga et al. 2018):
- Detection: Binary classification of rumor vs. non-rumor
- Tracking: Collection and threading of source and responses
- Stance classification: Labeling tweet stance as Supporting, Denying, Questioning, or Commenting
- Veracity classification: Determining if the rumor is True, False, or Unverified (the ultimate goal)
Sequential baseline: branchLSTM¶
The baseline model extends the branch-LSTM approach from Kochkina et al. 2017:
- Decomposes Twitter conversations into linear branches
- Represents each tweet as the mean of word2vec embeddings (pre-trained on Google News, 300d)
- Feeds branch sequences through LSTM layers and applies softmax for per-tweet predictions
- Uses majority voting on branch predictions to obtain thread-level labels
- Trained with categorical cross-entropy loss
Multi-task learning approach¶
The architecture uses hard parameter sharing: a shared LSTM layer fed by processed tweet sequences, with task-specific output layers:
Input (tweet branches)
↓
Shared LSTM (1–2 layers)
↓
Task-specific layers (Dense + ReLU + Softmax for each task)
├→ Stance classification (per-tweet predictions)
├→ Veracity classification (per-branch aggregation for RumourEval, per-thread for PHEME)
└→ Rumor detection (binary, per-thread)
Three multi-task configurations tested: - MTL2 Veracity+Stance: Joint learning of veracity (main) and stance (auxiliary) - MTL2 Veracity+Detection: Joint learning of veracity and rumor detection - MTL3 (all three): Joint learning of all three tasks
Loss function: weighted sum of individual task losses, with equal weights to all tasks. Loss is categorical cross-entropy. For imbalanced datasets, macro F-score and accuracy both reported.
Features and preprocessing¶
- Word representations: word2vec embeddings (300d) from Google News
- Aggregation: Mean pooling of word embeddings per tweet
- Hyperparameters: Optimized via Tree of Parzen Estimators (TPE):
- Dense ReLU layers: 1–4
- LSTM units: 100–300
- Dropout: 50% before output
- L2 regularization: 10^{-3}–10^{-5}
- Mini-batch: 32; epochs: 50
Results¶
RumourEval dataset¶
| Metric | Majority Baseline | NileTMRG* | branchLSTM | MTL2 (Veracity+Stance) | MTL2 (Veracity+Detection) | MTL3 (all) |
|---|---|---|---|---|---|---|
| Macro F | 0.148 | 0.539 | 0.491 | 0.558 | 0.571 | — |
| Accuracy | 0.286 | 0.570 | 0.500 | 0.571 | — | — |
MTL2 (Veracity+Stance) achieves 0.558 macro F on veracity classification, outperforming the single-task branchLSTM baseline (0.491) by 13.6%.
PHEME 5 largest events¶
| Metric | Majority | NileTMRG* | branchLSTM | MTL2 (Veracity+Stance) | MTL2 (Veracity+Detection) | MTL3 (all) |
|---|---|---|---|---|---|---|
| Macro F | 0.226 | 0.339 | 0.454 | 0.441 | 0.410 | 0.492 |
| Accuracy | 0.511 | 0.438 | 0.454 | 0.441 | 0.410 | 0.492 |
MTL3 (all three tasks) achieves 0.492 macro F on veracity, showing 8.4% improvement over branchLSTM. However, on RumourEval, additional tasks beyond stance provide diminishing returns.
PHEME 9 events (full dataset)¶
| Metric | Majority | NileTMRG* | branchLSTM | MTL2 (Veracity+Stance) | MTL2 (Veracity+Detection) | MTL3 (all) |
|---|---|---|---|---|---|---|
| Macro F | 0.203 | 0.297 | 0.314 | 0.357 | 0.397 | 0.405 |
| Accuracy | 0.444 | 0.360 | 0.314 | 0.357 | 0.397 | 0.405 |
MTL3 achieves 0.405 macro F, 28.9% improvement over single-task branchLSTM. Detection as an auxiliary task provides more signal on the larger 9-event set.
Per-event analysis¶
Performance varies significantly by event. Charlie Hebdo (0.327 MTL3), Sydney siege (0.350), and Ferguson (0.189) are hardest due to class imbalance and small event size. Germanwings-crash (0.429) shows strong multi-task gains when events contain all three label classes.
Connections¶
- Detection and Resolution of Rumours in Social Media: A Survey — comprehensive review of the four-step rumor classification pipeline this work implements
- Turing at SemEval-2017: Sequential Approach to Rumour Stance Classification with Branch-LSTM — introduces the branch-LSTM architecture extended here for multi-task learning
- Stance classification — related topic on stance detection as auxiliary task for rumor verification
- SemEval-2017 Task 8: RumourEval — shared task defining the RumourEval dataset used in evaluation
- RumourEval 2019 — follow-up shared task extending rumor stance and verification to Reddit data
- Multi-task learning — broader topic on multi-task learning approaches in NLP
Notes¶
Strengths: - Systematic exploration of multi-task learning configurations on rumor verification; shows benefits of auxiliary task selection - Strong empirical improvements on PHEME especially (28.9% on 9 events) - Thoughtful analysis linking dataset properties (kurtosis, entropy, TTR) to multi-task gains—lower kurtosis (more balanced label distributions) benefits more from auxiliary tasks - Clear demonstration that not all auxiliary tasks help equally (stance+veracity > detection+veracity depending on dataset) - Reproducible methodology with hyperparameter tuning via TPE
Limitations: - Models achieve lower performance on 5-event PHEME subset (likely fewer training examples and imbalanced labels), constraining generalization claims - RumourEval improvements are modest (0.491 → 0.558) and single-task branching already competitive with multi-task in some configurations - No analysis of which shared representations are learned or what features the auxiliary tasks extract - Limited to Twitter data; unclear if findings transfer to Reddit (tested in RumourEval 2019) - Veracity prediction still lags behind stance classification (0.558 F is much lower than 0.78+ stance F-scores); multi-task learning helps but the core challenge remains - No ablation on task loss weighting—equal weights may not be optimal; could benefit from loss scheduling or learned task weights - Post-SemEval setup: uses different rumor detection labels than standard PHEME, making direct comparison to prior work difficult
Follow-up work: - RumourEval 2019 extended this to Reddit and cross-platform settings, with newer datasets showing continued gains for auxiliary tasks - Later work explores hierarchical multi-task learning and learned task weighting strategies - Joint stance+veracity learning is now standard in rumor verification shared tasks