Shared tasks and benchmarks¶

Shared tasks are organized evaluation campaigns where researchers develop and submit systems for a standardized problem using a common dataset, evaluation protocol, and set of metrics. They play a critical role in advancing the field by enabling direct comparison of approaches and identifying state-of-the-art performance.

Role in misinformation research¶

Shared tasks have become central to misinformation and rumour detection research, providing:

Benchmark datasets that enable reproducible research
Standardized evaluation metrics for fair comparison
Community participation from researchers across institutions and countries
Published results and analysis of participating systems
Reusable resources for follow-up research

Key characteristics¶

Common dataset: Participants develop systems on the same training and test data
Defined task(s): Clear problem formulation and evaluation criteria
Blind evaluation: Test set not accessible to participants during development
Shared results: Organizers publish leaderboards and comparative analysis
Workshop: Presentation and discussion of approaches at a venue (e.g., ACL, SemEval)

Notable shared task venues¶

SemEval: Annual series of NLP shared tasks (Semantic Evaluation workshops)
CLEF: Cross Language Evaluation Forum with specialized fact-checking and rumour tracks
FEVER: Fact Extraction and VERification shared task series

Contribution to the field¶

Shared tasks establish benchmarks that: - Enable decade-long progress tracking in a subfield - Lower barriers to entry for new researchers - Consolidate diverse approaches into comparable results - Guide future research directions through remaining challenges

Key benchmarks in misinformation detection¶

The Clickbait Challenge 2017: Towards a Regression Model for Clickbait Strength — Clickbait Challenge 2017: 38,517 graded-annotated tweets; 13 submitted systems; reframes task as regression to measure clickbait strength
RumourEval 2019 — rumour verification shared task with SDQC stance and veracity prediction; Twitter and Reddit data; 22 systems
SemEval-2017 Task 8: RumourEval — benchmark task on rumour verification with SDQC stance and veracity prediction subtasks
SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media — SemEval-2019 Task 6 on offensive language detection with OLID dataset; 115 participating systems across three hierarchical sub-tasks; demonstrates scalable evaluation infrastructure for content moderation