All-in-one: Multi-task Learning for Rumour Verification¶

Authors: Elena Kochkina, Maria Liakata, Arkaitz Zubiaga

Affiliation: University of Warwick, Alan Turing Institute

Venue: arXiv, June 2018

TL;DR¶

This paper proposes a multi-task learning framework for rumor verification that jointly trains veracity classification (the main task) with auxiliary tasks—rumor detection and stance classification. The approach outperforms single-task baselines on both RumourEval and PHEME datasets, demonstrating that sharing representations across related subtasks improves veracity prediction, particularly on datasets with lower kurtosis in label distributions.

Contributions¶

Multi-task learning framework for rumor verification: Demonstrates that joint training of veracity classification with stance classification and rumor detection improves performance on the main task
Empirical analysis of task combinations: Evaluates three multi-task scenarios (veracity+stance, veracity+detection, all three) and shows veracity+stance produces the strongest improvements
Link between data properties and MTL effectiveness: Analyzes kurtosis, entropy, and token-type ratio properties of datasets and correlates them with multi-task learning outcomes—lower kurtosis events benefit more from multi-task approaches
Evaluation on two rumor datasets: Comprehensive experiments on PHEME and RumourEval with leave-one-event-out cross-validation, showing macro F-score improvements of 5–15% points depending on dataset and auxiliary task combination

Method¶

Rumor resolution pipeline¶

The framework decomposes rumor verification into four sequential subtasks (following Zubiaga et al. 2018):

Detection: Binary classification of rumor vs. non-rumor
Tracking: Collection and threading of source and responses
Stance classification: Labeling tweet stance as Supporting, Denying, Questioning, or Commenting
Veracity classification: Determining if the rumor is True, False, or Unverified (the ultimate goal)

Sequential baseline: branchLSTM¶

The baseline model extends the branch-LSTM approach from Kochkina et al. 2017:

Decomposes Twitter conversations into linear branches
Represents each tweet as the mean of word2vec embeddings (pre-trained on Google News, 300d)
Feeds branch sequences through LSTM layers and applies softmax for per-tweet predictions
Uses majority voting on branch predictions to obtain thread-level labels
Trained with categorical cross-entropy loss

Multi-task learning approach¶

The architecture uses hard parameter sharing: a shared LSTM layer fed by processed tweet sequences, with task-specific output layers:

Input (tweet branches)
    ↓
Shared LSTM (1–2 layers)
    ↓
Task-specific layers (Dense + ReLU + Softmax for each task)
    ├→ Stance classification (per-tweet predictions)
    ├→ Veracity classification (per-branch aggregation for RumourEval, per-thread for PHEME)
    └→ Rumor detection (binary, per-thread)

Three multi-task configurations tested: - MTL2 Veracity+Stance: Joint learning of veracity (main) and stance (auxiliary) - MTL2 Veracity+Detection: Joint learning of veracity and rumor detection - MTL3 (all three): Joint learning of all three tasks

Loss function: weighted sum of individual task losses, with equal weights to all tasks. Loss is categorical cross-entropy. For imbalanced datasets, macro F-score and accuracy both reported.

Features and preprocessing¶

Word representations: word2vec embeddings (300d) from Google News
Aggregation: Mean pooling of word embeddings per tweet
Hyperparameters: Optimized via Tree of Parzen Estimators (TPE):
Dense ReLU layers: 1–4
LSTM units: 100–300
Dropout: 50% before output
L2 regularization: 10^{-3}–10^{-5}
Mini-batch: 32; epochs: 50

Results¶

RumourEval dataset¶

Metric	Majority Baseline	NileTMRG*	branchLSTM	MTL2 (Veracity+Stance)	MTL2 (Veracity+Detection)	MTL3 (all)
Macro F	0.148	0.539	0.491	0.558	0.571	—
Accuracy	0.286	0.570	0.500	0.571	—	—

MTL2 (Veracity+Stance) achieves 0.558 macro F on veracity classification, outperforming the single-task branchLSTM baseline (0.491) by 13.6%.

PHEME 5 largest events¶

Metric	Majority	NileTMRG*	branchLSTM	MTL2 (Veracity+Stance)	MTL2 (Veracity+Detection)	MTL3 (all)
Macro F	0.226	0.339	0.454	0.441	0.410	0.492
Accuracy	0.511	0.438	0.454	0.441	0.410	0.492

MTL3 (all three tasks) achieves 0.492 macro F on veracity, showing 8.4% improvement over branchLSTM. However, on RumourEval, additional tasks beyond stance provide diminishing returns.

PHEME 9 events (full dataset)¶

Metric	Majority	NileTMRG*	branchLSTM	MTL2 (Veracity+Stance)	MTL2 (Veracity+Detection)	MTL3 (all)
Macro F	0.203	0.297	0.314	0.357	0.397	0.405
Accuracy	0.444	0.360	0.314	0.357	0.397	0.405

MTL3 achieves 0.405 macro F, 28.9% improvement over single-task branchLSTM. Detection as an auxiliary task provides more signal on the larger 9-event set.

Per-event analysis¶

Performance varies significantly by event. Charlie Hebdo (0.327 MTL3), Sydney siege (0.350), and Ferguson (0.189) are hardest due to class imbalance and small event size. Germanwings-crash (0.429) shows strong multi-task gains when events contain all three label classes.

Connections¶

Detection and Resolution of Rumours in Social Media: A Survey — comprehensive review of the four-step rumor classification pipeline this work implements
Turing at SemEval-2017: Sequential Approach to Rumour Stance Classification with Branch-LSTM — introduces the branch-LSTM architecture extended here for multi-task learning
Stance classification — related topic on stance detection as auxiliary task for rumor verification
SemEval-2017 Task 8: RumourEval — shared task defining the RumourEval dataset used in evaluation
RumourEval 2019 — follow-up shared task extending rumor stance and verification to Reddit data
Multi-task learning — broader topic on multi-task learning approaches in NLP

Notes¶

Strengths: - Systematic exploration of multi-task learning configurations on rumor verification; shows benefits of auxiliary task selection - Strong empirical improvements on PHEME especially (28.9% on 9 events) - Thoughtful analysis linking dataset properties (kurtosis, entropy, TTR) to multi-task gains—lower kurtosis (more balanced label distributions) benefits more from auxiliary tasks - Clear demonstration that not all auxiliary tasks help equally (stance+veracity > detection+veracity depending on dataset) - Reproducible methodology with hyperparameter tuning via TPE

Limitations: - Models achieve lower performance on 5-event PHEME subset (likely fewer training examples and imbalanced labels), constraining generalization claims - RumourEval improvements are modest (0.491 → 0.558) and single-task branching already competitive with multi-task in some configurations - No analysis of which shared representations are learned or what features the auxiliary tasks extract - Limited to Twitter data; unclear if findings transfer to Reddit (tested in RumourEval 2019) - Veracity prediction still lags behind stance classification (0.558 F is much lower than 0.78+ stance F-scores); multi-task learning helps but the core challenge remains - No ablation on task loss weighting—equal weights may not be optimal; could benefit from loss scheduling or learned task weights - Post-SemEval setup: uses different rumor detection labels than standard PHEME, making direct comparison to prior work difficult

Follow-up work: - RumourEval 2019 extended this to Reddit and cross-platform settings, with newer datasets showing continued gains for auxiliary tasks - Later work explores hierarchical multi-task learning and learned task weighting strategies - Joint stance+veracity learning is now standard in rumor verification shared tasks