Skip to content
RumourEval 2019: Determining Rumour Veracity and Support for Rumours

RumourEval 2019: Determining Rumour Veracity and Support for Rumours

Authors: Genevieve Gorrell, Elena Kochkina, Maria Liakata, Ahmet Aker, Arkaitz Zubiaga, Kalina Bontcheva, Leon Derczynski

Venue: Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019), pages 845–854

TL;DR

RumourEval 2019 is a shared task for rumour stance detection and veracity prediction on social media. Extending the 2017 edition, it introduces Reddit data alongside Twitter, provides datasets annotated for SDQC stance (Support/Deny/Query/Comment) and three-way veracity labels, and received 22 system submissions showing that modern neural approaches with pre-trained contextual embeddings significantly advance both tasks.

Contributions

  • Extended benchmark dataset combining Twitter and Reddit rumours from breaking news events, annotated for both stance and veracity
  • Two interconnected subtasks: Subtask A (stance classification) provides 8,574 labeled conversation posts; Subtask B (veracity prediction) adds 446 source rumours labeled as true/false/unverified
  • Evaluation methodology using macro-averaged F1 to account for class imbalance rather than accuracy
  • Baseline systems including branchLSTM (RumourEval 2017 winner) and NileTMRG linear SVM approaches
  • Experimental validation of 22 systems demonstrating progress in both tasks, with top performers combining neural architectures with pre-trained contextual representations

Method

The shared task frames rumour verification as a two-stage process:

Subtask A (Stance Classification): Given a source post containing a rumourous claim and a conversation thread of replies, classify each reply post's stance toward the rumour as one of four categories: - Support: the reply agrees with or endorses the rumour - Deny: the reply contradicts the rumour - Query: the reply questions or seeks clarification about the rumour - Comment: the reply is related to the rumour but does not take a stance

Subtask B (Veracity Prediction): Given the source post and optionally the discussion thread, classify the rumour's veracity as: - True: the rumour is factually accurate - False: the rumour is factually incorrect - Unverified: the veracity cannot be determined; systems return a confidence score (0 for unverified)

Data comes from two distinct platforms: - Twitter: 325 training rumours (145 true, 74 false, 106 unverified) with 5,568 annotated replies; 56 test rumours (22 true, 30 false, 4 unverified) with 1,066 annotated replies. Source tweets selected from debunking websites (Snopes, Politifact) about natural disasters. - Reddit: 40 training threads (9 true, 24 false, 7 unverified) with 1,134 annotated replies; 25 test threads (9 true, 10 false, 6 unverified) with 806 annotated replies. Deeper, more complex conversations than Twitter; rumours often implicitly queried rather than asserted.

Stance annotation for Twitter test data used crowdsourcing via FigureEight with 10 annotators per tweet, 70% agreement threshold, achieving 76.2% macro-agreement. Reddit annotation required stricter annotator training (51-question quiz) due to complex conversational structure; 78% macro-agreement achieved with 3.84 annotations per post on average.

Veracity labels sourced from professional fact-checking organizations (Snopes, Politifact) for Twitter and verified by community experts for Reddit, departing from the 2017 manual annotation approach.

Results

Subtask A (Stance Detection): - Best system (BLCU NLP): macro-F1 = 0.6187 - Three systems outperformed branchLSTM baseline (macro-F1 = 0.4929): BLCU NLP, BUT-FIT, eventAI - All 22 systems attempted this task - 50% of systems exceeded majority baseline (0.2234 macro-F1)

Subtask B (Veracity Prediction): - Best system (eventAI): macro-F1 = 0.5765, RMSE = 0.6078 - Only eventAI beat both baseline systems (NileTMRG at 0.3089, branchLSTM at 0.3364) - 13 of 22 systems attempted this harder task - Over 60% of systems outperformed majority baseline (0.2241 macro-F1)

Key observations: - Systems specializing in one task: best Subtask A performer (BLCU NLP) ranked 4th in B; best B performer (eventAI) ranked 3rd in A - Neural network approaches dominated (21 of 22 systems); best Subtask B winner used SVM+RF ensemble, not pure neural - Effective architectures: pre-trained contextual embeddings (BERT, GPT, ELMo); inference-chain models considering full conversation sequences; ensemble methods combining multiple features - RMSE showed stronger correlation with macro-F1 (−0.92) than accuracy (−0.77), validating confidence-score evaluation

Connections

Notes

Strengths: Large-scale shared task with 70% participation increase from 2017; diverse data spanning two platforms reduces overfitting to Twitter-specific phenomena; rigorous crowdsourced annotation with high agreement thresholds. The inclusion of Reddit captures deeper, more exploratory discussions compared to Twitter's reactive replies.

Weaknesses: Class imbalance persists despite macro-F1 evaluation (80% of Twitter test data is "comment" stance; 50% "false" veracity). Limited additional context (removed from 2017 due to time constraints, though it improved 2017 results). Relatively small test sets (81 rumours for veracity task) may not generalize broadly. English-only data despite prior work in other languages.

Future work: The paper identifies multilingual extension and richer temporal/discourse context modeling as critical next steps. Systems exploiting conversation structure (inference chains, discourse graphs) showed promise but remain underexplored relative to content-only approaches.