RumourEval 2019: Determining Rumour Veracity and Support for Rumours¶

Authors: Genevieve Gorrell, Elena Kochkina, Maria Liakata, Ahmet Aker, Arkaitz Zubiaga, Kalina Bontcheva, Leon Derczynski

Venue: Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019), pages 845–854

TL;DR¶

RumourEval 2019 is a shared task for rumour stance detection and veracity prediction on social media. Extending the 2017 edition, it introduces Reddit data alongside Twitter, provides datasets annotated for SDQC stance (Support/Deny/Query/Comment) and three-way veracity labels, and received 22 system submissions showing that modern neural approaches with pre-trained contextual embeddings significantly advance both tasks.

Contributions¶

Extended benchmark dataset combining Twitter and Reddit rumours from breaking news events, annotated for both stance and veracity
Two interconnected subtasks: Subtask A (stance classification) provides 8,574 labeled conversation posts; Subtask B (veracity prediction) adds 446 source rumours labeled as true/false/unverified
Evaluation methodology using macro-averaged F1 to account for class imbalance rather than accuracy
Baseline systems including branchLSTM (RumourEval 2017 winner) and NileTMRG linear SVM approaches
Experimental validation of 22 systems demonstrating progress in both tasks, with top performers combining neural architectures with pre-trained contextual representations

Method¶

The shared task frames rumour verification as a two-stage process:

Subtask A (Stance Classification): Given a source post containing a rumourous claim and a conversation thread of replies, classify each reply post's stance toward the rumour as one of four categories: - Support: the reply agrees with or endorses the rumour - Deny: the reply contradicts the rumour - Query: the reply questions or seeks clarification about the rumour - Comment: the reply is related to the rumour but does not take a stance

Subtask B (Veracity Prediction): Given the source post and optionally the discussion thread, classify the rumour's veracity as: - True: the rumour is factually accurate - False: the rumour is factually incorrect - Unverified: the veracity cannot be determined; systems return a confidence score (0 for unverified)

Data comes from two distinct platforms: - Twitter: 325 training rumours (145 true, 74 false, 106 unverified) with 5,568 annotated replies; 56 test rumours (22 true, 30 false, 4 unverified) with 1,066 annotated replies. Source tweets selected from debunking websites (Snopes, Politifact) about natural disasters. - Reddit: 40 training threads (9 true, 24 false, 7 unverified) with 1,134 annotated replies; 25 test threads (9 true, 10 false, 6 unverified) with 806 annotated replies. Deeper, more complex conversations than Twitter; rumours often implicitly queried rather than asserted.

Stance annotation for Twitter test data used crowdsourcing via FigureEight with 10 annotators per tweet, 70% agreement threshold, achieving 76.2% macro-agreement. Reddit annotation required stricter annotator training (51-question quiz) due to complex conversational structure; 78% macro-agreement achieved with 3.84 annotations per post on average.

Veracity labels sourced from professional fact-checking organizations (Snopes, Politifact) for Twitter and verified by community experts for Reddit, departing from the 2017 manual annotation approach.

Results¶

Subtask A (Stance Detection): - Best system (BLCU NLP): macro-F1 = 0.6187 - Three systems outperformed branchLSTM baseline (macro-F1 = 0.4929): BLCU NLP, BUT-FIT, eventAI - All 22 systems attempted this task - 50% of systems exceeded majority baseline (0.2234 macro-F1)

Subtask B (Veracity Prediction): - Best system (eventAI): macro-F1 = 0.5765, RMSE = 0.6078 - Only eventAI beat both baseline systems (NileTMRG at 0.3089, branchLSTM at 0.3364) - 13 of 22 systems attempted this harder task - Over 60% of systems outperformed majority baseline (0.2241 macro-F1)

Key observations: - Systems specializing in one task: best Subtask A performer (BLCU NLP) ranked 4th in B; best B performer (eventAI) ranked 3rd in A - Neural network approaches dominated (21 of 22 systems); best Subtask B winner used SVM+RF ensemble, not pure neural - Effective architectures: pre-trained contextual embeddings (BERT, GPT, ELMo); inference-chain models considering full conversation sequences; ensemble methods combining multiple features - RMSE showed stronger correlation with macro-F1 (−0.92) than accuracy (−0.77), validating confidence-score evaluation

Connections¶

Extends RumourEval 2017 with new data and experimental design to encourage information-rich approaches
Related to stance detection literature via SDQC classification framework
Benchmark for propagation-based detection methods using conversation threads
Dataset contribution to rumour verification research alongside FEVER and other shared tasks
Demonstrated effectiveness of contextual embeddings (BERT, GPT) connects to broader NLP detection methodology

Notes¶

Strengths: Large-scale shared task with 70% participation increase from 2017; diverse data spanning two platforms reduces overfitting to Twitter-specific phenomena; rigorous crowdsourced annotation with high agreement thresholds. The inclusion of Reddit captures deeper, more exploratory discussions compared to Twitter's reactive replies.

Weaknesses: Class imbalance persists despite macro-F1 evaluation (80% of Twitter test data is "comment" stance; 50% "false" veracity). Limited additional context (removed from 2017 due to time constraints, though it improved 2017 results). Relatively small test sets (81 rumours for veracity task) may not generalize broadly. English-only data despite prior work in other languages.

Future work: The paper identifies multilingual extension and richer temporal/discourse context modeling as critical next steps. Systems exploiting conversation structure (inference chains, discourse graphs) showed promise but remain underexplored relative to content-only approaches.