Skip to content
Turing at SemEval-2017 Task 8: Sequential Approach to Rumour Stance Classification with Branch-LSTM

Turing at SemEval-2017 Task 8: Sequential Approach to Rumour Stance Classification with Branch-LSTM

Authors: Elena Kochkina, Maria Liakata, Isabelle Augenstein

Affiliation: University of Warwick, Alan Turing Institute, University College London

Venue: Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), August 3–4, 2017

arXiv: 1704.07221

TL;DR

This paper proposes Branch-LSTM, a sequential neural model for rumour stance classification in Twitter conversations. By decomposing conversation threads into linear branches and modeling them with LSTM layers, the approach captures conversational context while assigning stance labels (Support/Deny/Query/Comment) to each tweet. The system achieved 78.4% accuracy on RumourEval 2017 Subtask A, outperforming all 12 competing systems.

Contributions

  • Branch-LSTM architecture: A novel LSTM-based approach that decomposes tree-structured conversations into linear branches to exploit sequential structure
  • State-of-the-art performance: Achieves 78.4% accuracy on RumourEval 2017 test set, setting the best results in Subtask A
  • Feature engineering: Combines word2vec embeddings with hand-crafted lexical, syntactic, and relational features (negation counts, punctuation, URL presence, similarity to source/preceding tweets)
  • Systematic analysis: Detailed error analysis showing strengths on majority classes (commenting: 87.3% F-score) and challenges with imbalanced classes (denying: 0% F-score)

Method

Architecture

The branch-LSTM processes tweet sequences within conversation branches:

  1. Branch decomposition: Each conversation thread is a tree structure. A branch is defined as a chain from a leaf tweet back through its parent chain to the source tweet. Overlapping tweets are masked at training to avoid duplicate loss computation.

  2. Tweet representation: Each tweet is represented as the mean of its word2vec embeddings (pre-trained on Google News, 300d) concatenated with hand-crafted features:

  3. Lexical: negation word count, swear word count
  4. Syntactic: period, exclamation mark, question mark, capital letter ratio
  5. Attachments: URL presence, image presence
  6. Relational: cosine similarity to source tweet, preceding tweet, and entire thread (using word2vec)
  7. Content: word count, character count
  8. Role: whether the tweet is the source tweet

  9. Sequential modeling: Input at each time step is a tweet representation vector. The LSTM layer processes the sequence and outputs predictions at each step. Outputs feed through dense ReLU layers (1–4 layers), 50% dropout, and softmax to produce class probabilities.

  10. Training: Categorical cross-entropy loss with class masking to handle branch overlap. Hyperparameters optimized via Tree of Parzen Estimators (TPE) over 100 trials.

Dataset

RumourEval 2017 dataset (Derczynski et al. 2017): - Training: 272 threads, 3,030 branches, 4,238 tweets - Development: 25 threads, 215 branches, 281 tweets
- Test: 28 threads, 772 branches, 1,049 tweets - Class distribution: Comments (66%), Support (18%), Deny (8%), Query (8%)

Results

Metric Development Testing
Accuracy 0.782 0.784
Macro F 0.561 0.434

Per-class F-scores (test set): - Support: 0.403 - Deny: 0.000 - Query: 0.462
- Comment: 0.873

Error analysis

  • Strong performance on source tweets (depth 0): 92.9% accuracy; source tweets are predominantly supporting
  • Declining performance at deeper depths: Supporting tweet classification at non-root positions drops significantly
  • Deny class failure: 68 denying tweets misclassified as commenting; only 1 misclassified as querying, 2 as supporting. Indicates severe class imbalance challenge.
  • Comments dominate: 760/1,049 test predictions are comments; model overrelies on majority class despite focal losses explored

Comparison to participating systems

The Turing submission (78.4%) outperformed 12 other systems in Subtask A, with the second-place UWaterloo system achieving 78.0%.

Connections

Notes

Strengths: - Effective architecture for exploiting conversation structure; sequential branching is intuitive and practical - Thorough hyperparameter tuning via TPE; reproducible methodology - Detailed per-depth performance analysis revealing architectural strengths/weaknesses - Strong empirical results with clear state-of-the-art performance - Feature combinations well-motivated: relational features (similarity to source/preceding tweets) are particularly relevant for stance in context

Limitations: - Severe class imbalance prevents learning of minority classes (denying class F-score is 0.0) - Limited to single-branch decomposition—tree structure information beyond linear chains is discarded - Hand-crafted features may not generalize; no ablation study on feature importance - Bidirectional LSTM variants offered no improvement, but analysis is limited - No cross-event generalization analysis; all evaluation is in-domain - Post-hoc fixes for class imbalance (masking) are modest; authors note more labeled data would help - Unclear why query class (F: 0.462) performs better than deny despite similar train/test representation

Follow-up directions: - The denying class remains a challenge for subsequent work; later RumourEval editions continue to struggle with this minority class - Tree-structured models (e.g., Tree-LSTMs, GNNs) might better capture the full conversation structure than linear branching - Joint learning with veracity prediction (Subtask B) could provide auxiliary signals to help stance detection