The Fact Extraction and VERification (FEVER) Shared Task¶

Authors: James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, Arpit Mittal

Venue: ACL 2018 Shared Task Workshop

TL;DR¶

This paper presents results of the first FEVER shared task competition, which challenged 23 teams to classify whether human-written factoid claims could be SUPPORTED or REFUTED using evidence retrieved from Wikipedia. The best system achieved 64.21% FEVER score, demonstrating the difficulty of joint evidence retrieval and natural language inference for fact verification; most participating systems followed a three-stage pipeline (document selection, sentence selection, NLI) though some jointly optimized evidence extraction and verification.

Contributions¶

First large-scale shared task on fact extraction and verification, establishing a benchmark combining evidence retrieval from a corpus with textual entailment reasoning
185,445 human-generated claims dataset (FEVER) with three labels (SUPPORTED, REFUTED, NOT ENOUGH INFO), manually verified against Wikipedia with annotated evidence sentences
Comprehensive system analysis across 23 teams showing dominant architectural patterns: document retrieval via named entities/Wikipedia search → sentence selection via neural classifiers → NLI via Enhanced LSTM or Transformers
Evidence augmentation methodology: post-competition annotation of previously unseen correct evidence from 18,846 claims, identifying 308 new evidence sets and correcting 87 mislabeled claims

Method¶

The FEVER shared task requires systems to:

Classify claim veracity into three classes: SUPPORTED (evidence fully supports the claim), REFUTED (evidence contradicts it), or NOT ENOUGH INFO (Wikipedia contains insufficient evidence)
Retrieve supporting or refuting evidence as complete sets of Wikipedia sentences—claims require evidence for SUPPORTED/REFUTED labels to count toward the primary FEVER score

Dataset construction: Claims were generated by paraphrasing Wikipedia facts and applying systematic mutations (some meaning-preserving, some meaning-altering). Annotators selected evidence sentences without knowing the source pages. The dataset was split by generating Wikipedia page, creating disjoint train/dev/test splits (80,035 SUPPORTED / 29,775 REFUTED / 35,639 NOT ENOUGH INFO in training).

Scoring metric: The primary metric is label accuracy conditioned on providing at least one complete set of evidence. Precision, recall, and F₁ of evidence are also reported to diagnose retrieval vs. reasoning performance.

Results¶

Final leaderboard (86 total submissions from 23 teams): - Rank 1 (UNC-NLP): 64.21% FEVER score, 70.91% label accuracy, 42.27% evidence precision - Rank 2 (UCL Machine Reading Group): 62.52%, 82.84% label accuracy, 22.16% evidence precision - Rank 3 (Athene UKP TU Darmstadt): 61.58%, 85.19% label accuracy, 23.61% evidence precision - Baseline (published earlier): 27.45% FEVER score

Architectural patterns identified:

Document selection: Multi-step approaches using named entities, noun phrases, and capitalized expressions. Top teams reported using Wikipedia search API or Lucene/Solr indexes. UNC-NLP ranked candidates by Wikipedia page viewership statistics.

Sentence selection: Three approaches prevalent—keyword matching (token/NE overlap), supervised binary classification (Enhanced LSTM, Decomposable Attention), and similarity scoring (Word Mover's Distance, cosine similarity over ELMo/TFIDF embeddings).

Natural language inference: All submissions modeled NLI as supervised classification. Evidence combination strategies varied: UNC-NLP concatenated evidence into a single string; others classified evidence-claim pairs individually then aggregated. Sentence representations ranged from non-lexical features (negation, antonyms, noun overlap) to contextualized embeddings (ELMo, WordNet).

Connections¶

Extends SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours by adding evidence retrieval as a required subtask alongside stance detection
Related to EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection via shared focus on integrating evidence into fake news detection pipelines
Precursor dataset to extensive fact verification research; cited in A Survey on Natural Language Processing for Fake News Detection as a major benchmark for textual entailment and evidence-based methods
Related to Fact-Checking: A Meta-Analysis of What Works and for Whom for formal definition of fact-checking task and evaluation metrics

Notes¶

Strengths: - First shared task to jointly require evidence retrieval from a large corpus and reasoning over retrieved evidence, capturing a realistic fact-verification scenario - Large annotated dataset (185,445 claims) with high-quality Wikipedia evidence; split by generating page ensures test generalization - Post-competition evidence annotation (1,003 additional annotations) identified correctable label noise, improving dataset quality - Transparent leaderboard with detailed system descriptions from 15 of 23 teams, enabling architectural comparison and reproducibility - Scoring metric (FEVER score) correctly incentivizes both evidence precision and verification accuracy

Weaknesses: - Dataset construction via paraphrasing Wikipedia facts introduces artificial claim generation; coverage of real-world misinformation types (satire, rumors, synthetic claims) is limited - Wikipedia as sole evidence source limits real-world applicability; many claims require cross-domain or out-of-date information - Evidence annotation by participants post-hoc rather than during dataset construction; incomplete evidence coverage in original annotations may have underestimated true performance - NOT ENOUGH INFO label definition unclear; some claims labeled NEI should arguably be REFUTED (e.g., claims about recent events Wikipedia hadn't covered at annotation time) - Top-performing systems achieved only 64.21% FEVER score despite simple baseline structure, suggesting fundamental difficulty of the joint task remains unresolved

Follow-up opportunities: - Scaling evidence sources beyond Wikipedia (news archives, scientific papers, domain-specific databases) - Multi-hop reasoning where evidence spans multiple Wikipedia pages or requires inference chains - Out-of-domain generalization to non-Wikipedia claims and different evidence sources - Real-time fact-checking systems addressing continuous information updates