SemEval-2017 Task 8: RumourEval¶
Authors: Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, Arkaitz Zubiaga
Venue: Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), pages 69–76, August 3–4, 2017
DOI/URL: ACL Anthology
TL;DR¶
RumourEval is a shared task benchmark for detecting and verifying rumours in social media. It introduces two subtasks: (a) stance classification (SDQC: Support/Deny/Query/Comment) of replies to rumourous claims, and (b) veracity prediction (true/false) of source tweets. The task combines annotation schemes, datasets from 8–10 events, and results from 13 participating systems, establishing a foundation for rumour verification research on Twitter.
Contributions¶
- Annotation scheme for rumours and community reactions (SDQC framework)
- Large benchmark dataset with 297 training threads (8 events) and 28 test threads (10 events), totalling 5,599 tweets
- Two-subtask framework:
- Subtask A: Stance classification in conversation threads
- Subtask B: Veracity prediction of individual claims
- Evaluation metrics and baseline results from 13 systems, establishing benchmarks for the community
Method¶
Data collection and annotation¶
The task uses Twitter threads collected around newsworthy events. For each event, researchers:
- Sample likely rumourous tweets (high retweet count)
- Manually identify unverified claims (by journalist annotators)
- Collect all replies to each rumourous source tweet
- Annotate reply tweets via crowdsourcing (SDQC labels) and journalist consensus (veracity)
Training data: 297 threads across 8 events (Charlie Hebdo shooting, Ferguson unrest, Germanwings plane crash, etc.)
Test data: 28 additional threads including 2 new events (Hillary Clinton pneumonia rumour during 2016 election, Marina Joyce kidnapping rumour)
Subtask A: SDQC stance classification¶
Classify each reply tweet into one of four categories:
- Support (S): Author agrees with the rumour's veracity
- Deny (D): Author refutes the rumour
- Query (Q): Author requests additional evidence
- Comment (C): Author makes a comment without direct veracity stance
Replies form tree-structured conversations; context from preceding tweets is important.
Subtask B: Veracity prediction¶
Classify source tweets as true or false based on: - Closed variant: Tweet text only - Open variant: Tweet text + Wikipedia articles + archived linked URLs
Systems also report confidence scores (0–1); confidence of 0 indicates unverifiable claims.
Results¶
Subtask A (SDQC classification)¶
Best system (Turing): 78.4% accuracy using sequential LSTM classification accounting for tweet context.
| Rank | Team | Accuracy |
|---|---|---|
| 1 | Turing | 0.784 |
| 2 | UWaterloo | 0.780 |
| 3 | ECNU | 0.778 |
| Baseline (4-way) | – | 0.741 |
Key insight: Systems explicitly addressing class imbalance (especially over-representation of comments) performed best.
Subtask B (Veracity prediction)¶
Closed variant (best: IKM/NileTMRG, 53.6%): - Most systems underperformed the baseline (57.1%), suggesting veracity is AI-hard - Confidence calibration was weak even for correct predictions
Open variant (best: IITP, 39.3%): - Additional context (Wikipedia, archived URLs) did not substantially improve results - Indicates challenge lies beyond information retrieval
Connections¶
- Related to Stance Detection literature and SemEval-2016 Task 6 on general stance detection
- Builds on Pheme Project research framework for rumour analysis in social media
- Precedes Rumour Verification Surveys that synthesize follow-up work
- Foundational for Misinformation and fake news detection benchmarks in NLP
- Influenced subsequent work on Context Aware Rumour Detection and Temporal Rumour Evolution
Notes¶
Strengths: - Well-motivated task with clear real-world applications (journalism, crisis response) - Rigorous annotation protocol validated in prior work - Diverse global participant pool (13 teams from 4 continents) - Public dataset release with Twitter's compliance - Clear distinction from prior work (SemEval-2016 Task 6 on stance, SemEval-2015 Task 3 on CQA)
Limitations: - Class imbalance in SDQC (comments dominate, representing ~71% of training labels) - Inter-annotator agreement for replies only 62.2% (vs. 81.1% for source tweets), indicating task difficulty - Veracity prediction substantially harder than stance—even best systems underperform majority baselines - Limited to English Twitter; generalization to other languages/platforms unclear - 2016 Wikipedia snapshot used in open variant; temporal information incomplete for some events
Impact: This paper established RumourEval as the benchmark task in the field. Follow-up RumourEval editions (2018, later) extended the task, and the dataset remains widely cited in rumour detection research. The SDQC framework became standard in subsequent work on community-based claim verification.