Skip to content
SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

SemEval-2017 Task 8: RumourEval

Authors: Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, Arkaitz Zubiaga

Venue: Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), pages 69–76, August 3–4, 2017

DOI/URL: ACL Anthology

TL;DR

RumourEval is a shared task benchmark for detecting and verifying rumours in social media. It introduces two subtasks: (a) stance classification (SDQC: Support/Deny/Query/Comment) of replies to rumourous claims, and (b) veracity prediction (true/false) of source tweets. The task combines annotation schemes, datasets from 8–10 events, and results from 13 participating systems, establishing a foundation for rumour verification research on Twitter.

Contributions

  • Annotation scheme for rumours and community reactions (SDQC framework)
  • Large benchmark dataset with 297 training threads (8 events) and 28 test threads (10 events), totalling 5,599 tweets
  • Two-subtask framework:
  • Subtask A: Stance classification in conversation threads
  • Subtask B: Veracity prediction of individual claims
  • Evaluation metrics and baseline results from 13 systems, establishing benchmarks for the community

Method

Data collection and annotation

The task uses Twitter threads collected around newsworthy events. For each event, researchers:

  1. Sample likely rumourous tweets (high retweet count)
  2. Manually identify unverified claims (by journalist annotators)
  3. Collect all replies to each rumourous source tweet
  4. Annotate reply tweets via crowdsourcing (SDQC labels) and journalist consensus (veracity)

Training data: 297 threads across 8 events (Charlie Hebdo shooting, Ferguson unrest, Germanwings plane crash, etc.)

Test data: 28 additional threads including 2 new events (Hillary Clinton pneumonia rumour during 2016 election, Marina Joyce kidnapping rumour)

Subtask A: SDQC stance classification

Classify each reply tweet into one of four categories:

  • Support (S): Author agrees with the rumour's veracity
  • Deny (D): Author refutes the rumour
  • Query (Q): Author requests additional evidence
  • Comment (C): Author makes a comment without direct veracity stance

Replies form tree-structured conversations; context from preceding tweets is important.

Subtask B: Veracity prediction

Classify source tweets as true or false based on: - Closed variant: Tweet text only - Open variant: Tweet text + Wikipedia articles + archived linked URLs

Systems also report confidence scores (0–1); confidence of 0 indicates unverifiable claims.

Results

Subtask A (SDQC classification)

Best system (Turing): 78.4% accuracy using sequential LSTM classification accounting for tweet context.

Rank Team Accuracy
1 Turing 0.784
2 UWaterloo 0.780
3 ECNU 0.778
Baseline (4-way) 0.741

Key insight: Systems explicitly addressing class imbalance (especially over-representation of comments) performed best.

Subtask B (Veracity prediction)

Closed variant (best: IKM/NileTMRG, 53.6%): - Most systems underperformed the baseline (57.1%), suggesting veracity is AI-hard - Confidence calibration was weak even for correct predictions

Open variant (best: IITP, 39.3%): - Additional context (Wikipedia, archived URLs) did not substantially improve results - Indicates challenge lies beyond information retrieval

Connections

Notes

Strengths: - Well-motivated task with clear real-world applications (journalism, crisis response) - Rigorous annotation protocol validated in prior work - Diverse global participant pool (13 teams from 4 continents) - Public dataset release with Twitter's compliance - Clear distinction from prior work (SemEval-2016 Task 6 on stance, SemEval-2015 Task 3 on CQA)

Limitations: - Class imbalance in SDQC (comments dominate, representing ~71% of training labels) - Inter-annotator agreement for replies only 62.2% (vs. 81.1% for source tweets), indicating task difficulty - Veracity prediction substantially harder than stance—even best systems underperform majority baselines - Limited to English Twitter; generalization to other languages/platforms unclear - 2016 Wikipedia snapshot used in open variant; temporal information incomplete for some events

Impact: This paper established RumourEval as the benchmark task in the field. Follow-up RumourEval editions (2018, later) extended the task, and the dataset remains widely cited in rumour detection research. The SDQC framework became standard in subsequent work on community-based claim verification.