RumourEval 2019: Determining Rumour Veracity and Support for Rumours¶
Authors: Genevieve Gorrell, Elena Kochkina, Maria Liakata, Ahmet Aker, Arkaitz Zubiaga, Kalina Bontcheva, Leon Derczynski
Venue: Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019), pages 845–854
TL;DR¶
RumourEval 2019 is a shared task for rumour stance detection and veracity prediction on social media. Extending the 2017 edition, it introduces Reddit data alongside Twitter, provides datasets annotated for SDQC stance (Support/Deny/Query/Comment) and three-way veracity labels, and received 22 system submissions showing that modern neural approaches with pre-trained contextual embeddings significantly advance both tasks.
Contributions¶
- Extended benchmark dataset combining Twitter and Reddit rumours from breaking news events, annotated for both stance and veracity
- Two interconnected subtasks: Subtask A (stance classification) provides 8,574 labeled conversation posts; Subtask B (veracity prediction) adds 446 source rumours labeled as true/false/unverified
- Evaluation methodology using macro-averaged F1 to account for class imbalance rather than accuracy
- Baseline systems including branchLSTM (RumourEval 2017 winner) and NileTMRG linear SVM approaches
- Experimental validation of 22 systems demonstrating progress in both tasks, with top performers combining neural architectures with pre-trained contextual representations
Method¶
The shared task frames rumour verification as a two-stage process:
Subtask A (Stance Classification): Given a source post containing a rumourous claim and a conversation thread of replies, classify each reply post's stance toward the rumour as one of four categories: - Support: the reply agrees with or endorses the rumour - Deny: the reply contradicts the rumour - Query: the reply questions or seeks clarification about the rumour - Comment: the reply is related to the rumour but does not take a stance
Subtask B (Veracity Prediction): Given the source post and optionally the discussion thread, classify the rumour's veracity as: - True: the rumour is factually accurate - False: the rumour is factually incorrect - Unverified: the veracity cannot be determined; systems return a confidence score (0 for unverified)
Data comes from two distinct platforms: - Twitter: 325 training rumours (145 true, 74 false, 106 unverified) with 5,568 annotated replies; 56 test rumours (22 true, 30 false, 4 unverified) with 1,066 annotated replies. Source tweets selected from debunking websites (Snopes, Politifact) about natural disasters. - Reddit: 40 training threads (9 true, 24 false, 7 unverified) with 1,134 annotated replies; 25 test threads (9 true, 10 false, 6 unverified) with 806 annotated replies. Deeper, more complex conversations than Twitter; rumours often implicitly queried rather than asserted.
Stance annotation for Twitter test data used crowdsourcing via FigureEight with 10 annotators per tweet, 70% agreement threshold, achieving 76.2% macro-agreement. Reddit annotation required stricter annotator training (51-question quiz) due to complex conversational structure; 78% macro-agreement achieved with 3.84 annotations per post on average.
Veracity labels sourced from professional fact-checking organizations (Snopes, Politifact) for Twitter and verified by community experts for Reddit, departing from the 2017 manual annotation approach.
Results¶
Subtask A (Stance Detection): - Best system (BLCU NLP): macro-F1 = 0.6187 - Three systems outperformed branchLSTM baseline (macro-F1 = 0.4929): BLCU NLP, BUT-FIT, eventAI - All 22 systems attempted this task - 50% of systems exceeded majority baseline (0.2234 macro-F1)
Subtask B (Veracity Prediction): - Best system (eventAI): macro-F1 = 0.5765, RMSE = 0.6078 - Only eventAI beat both baseline systems (NileTMRG at 0.3089, branchLSTM at 0.3364) - 13 of 22 systems attempted this harder task - Over 60% of systems outperformed majority baseline (0.2241 macro-F1)
Key observations: - Systems specializing in one task: best Subtask A performer (BLCU NLP) ranked 4th in B; best B performer (eventAI) ranked 3rd in A - Neural network approaches dominated (21 of 22 systems); best Subtask B winner used SVM+RF ensemble, not pure neural - Effective architectures: pre-trained contextual embeddings (BERT, GPT, ELMo); inference-chain models considering full conversation sequences; ensemble methods combining multiple features - RMSE showed stronger correlation with macro-F1 (−0.92) than accuracy (−0.77), validating confidence-score evaluation
Connections¶
- Extends RumourEval 2017 with new data and experimental design to encourage information-rich approaches
- Related to stance detection literature via SDQC classification framework
- Benchmark for propagation-based detection methods using conversation threads
- Dataset contribution to rumour verification research alongside FEVER and other shared tasks
- Demonstrated effectiveness of contextual embeddings (BERT, GPT) connects to broader NLP detection methodology
Notes¶
Strengths: Large-scale shared task with 70% participation increase from 2017; diverse data spanning two platforms reduces overfitting to Twitter-specific phenomena; rigorous crowdsourced annotation with high agreement thresholds. The inclusion of Reddit captures deeper, more exploratory discussions compared to Twitter's reactive replies.
Weaknesses: Class imbalance persists despite macro-F1 evaluation (80% of Twitter test data is "comment" stance; 50% "false" veracity). Limited additional context (removed from 2017 due to time constraints, though it improved 2017 results). Relatively small test sets (81 rumours for veracity task) may not generalize broadly. English-only data despite prior work in other languages.
Future work: The paper identifies multilingual extension and richer temporal/discourse context modeling as critical next steps. Systems exploiting conversation structure (inference chains, discourse graphs) showed promise but remain underexplored relative to content-only approaches.