Skip to content
The Fact Extraction and VERification (FEVER) Shared Task

The Fact Extraction and VERification (FEVER) Shared Task

Authors: James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, Arpit Mittal

Venue: ACL 2018 Shared Task Workshop

arXiv: 1811.10971

TL;DR

This paper presents results of the first FEVER shared task competition, which challenged 23 teams to classify whether human-written factoid claims could be SUPPORTED or REFUTED using evidence retrieved from Wikipedia. The best system achieved 64.21% FEVER score, demonstrating the difficulty of joint evidence retrieval and natural language inference for fact verification; most participating systems followed a three-stage pipeline (document selection, sentence selection, NLI) though some jointly optimized evidence extraction and verification.

Contributions

  • First large-scale shared task on fact extraction and verification, establishing a benchmark combining evidence retrieval from a corpus with textual entailment reasoning
  • 185,445 human-generated claims dataset (FEVER) with three labels (SUPPORTED, REFUTED, NOT ENOUGH INFO), manually verified against Wikipedia with annotated evidence sentences
  • Comprehensive system analysis across 23 teams showing dominant architectural patterns: document retrieval via named entities/Wikipedia search → sentence selection via neural classifiers → NLI via Enhanced LSTM or Transformers
  • Evidence augmentation methodology: post-competition annotation of previously unseen correct evidence from 18,846 claims, identifying 308 new evidence sets and correcting 87 mislabeled claims

Method

The FEVER shared task requires systems to:

  1. Classify claim veracity into three classes: SUPPORTED (evidence fully supports the claim), REFUTED (evidence contradicts it), or NOT ENOUGH INFO (Wikipedia contains insufficient evidence)
  2. Retrieve supporting or refuting evidence as complete sets of Wikipedia sentences—claims require evidence for SUPPORTED/REFUTED labels to count toward the primary FEVER score

Dataset construction: Claims were generated by paraphrasing Wikipedia facts and applying systematic mutations (some meaning-preserving, some meaning-altering). Annotators selected evidence sentences without knowing the source pages. The dataset was split by generating Wikipedia page, creating disjoint train/dev/test splits (80,035 SUPPORTED / 29,775 REFUTED / 35,639 NOT ENOUGH INFO in training).

Scoring metric: The primary metric is label accuracy conditioned on providing at least one complete set of evidence. Precision, recall, and F₁ of evidence are also reported to diagnose retrieval vs. reasoning performance.

Results

Final leaderboard (86 total submissions from 23 teams): - Rank 1 (UNC-NLP): 64.21% FEVER score, 70.91% label accuracy, 42.27% evidence precision - Rank 2 (UCL Machine Reading Group): 62.52%, 82.84% label accuracy, 22.16% evidence precision - Rank 3 (Athene UKP TU Darmstadt): 61.58%, 85.19% label accuracy, 23.61% evidence precision - Baseline (published earlier): 27.45% FEVER score

Architectural patterns identified:

Document selection: Multi-step approaches using named entities, noun phrases, and capitalized expressions. Top teams reported using Wikipedia search API or Lucene/Solr indexes. UNC-NLP ranked candidates by Wikipedia page viewership statistics.

Sentence selection: Three approaches prevalent—keyword matching (token/NE overlap), supervised binary classification (Enhanced LSTM, Decomposable Attention), and similarity scoring (Word Mover's Distance, cosine similarity over ELMo/TFIDF embeddings).

Natural language inference: All submissions modeled NLI as supervised classification. Evidence combination strategies varied: UNC-NLP concatenated evidence into a single string; others classified evidence-claim pairs individually then aggregated. Sentence representations ranged from non-lexical features (negation, antonyms, noun overlap) to contextualized embeddings (ELMo, WordNet).

Connections

Notes

Strengths: - First shared task to jointly require evidence retrieval from a large corpus and reasoning over retrieved evidence, capturing a realistic fact-verification scenario - Large annotated dataset (185,445 claims) with high-quality Wikipedia evidence; split by generating page ensures test generalization - Post-competition evidence annotation (1,003 additional annotations) identified correctable label noise, improving dataset quality - Transparent leaderboard with detailed system descriptions from 15 of 23 teams, enabling architectural comparison and reproducibility - Scoring metric (FEVER score) correctly incentivizes both evidence precision and verification accuracy

Weaknesses: - Dataset construction via paraphrasing Wikipedia facts introduces artificial claim generation; coverage of real-world misinformation types (satire, rumors, synthetic claims) is limited - Wikipedia as sole evidence source limits real-world applicability; many claims require cross-domain or out-of-date information - Evidence annotation by participants post-hoc rather than during dataset construction; incomplete evidence coverage in original annotations may have underestimated true performance - NOT ENOUGH INFO label definition unclear; some claims labeled NEI should arguably be REFUTED (e.g., claims about recent events Wikipedia hadn't covered at annotation time) - Top-performing systems achieved only 64.21% FEVER score despite simple baseline structure, suggesting fundamental difficulty of the joint task remains unresolved

Follow-up opportunities: - Scaling evidence sources beyond Wikipedia (news archives, scientific papers, domain-specific databases) - Multi-hop reasoning where evidence spans multiple Wikipedia pages or requires inference chains - Out-of-domain generalization to non-Wikipedia claims and different evidence sources - Real-time fact-checking systems addressing continuous information updates