Skip to content
Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection

Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection

Author: William Yang Wang

Affiliation: University of California, Santa Barbara

Venue: arXiv, 2017 — arXiv:1705.00648

TL;DR

This paper introduces LIAR, a benchmark dataset of 12,836 manually labeled short statements from POLITIFACT.COM spanning 2007–2016, addressing the critical lack of labeled data for fake news detection. Each statement is annotated with fine-grained veracity labels (pants-fire, false, barely-true, half-true, mostly-true, true) and rich metadata including speaker identity, party affiliation, statement context, and speaker credibility history. The dataset is an order of magnitude larger than prior public datasets and enables developing computational models for both fake news detection and fact-checking.

Contributions

  • LIAR dataset: 12,836 human-labeled short statements with 6-way fine-grained labels, addressing the data bottleneck for fake news detection research.
  • Authentic real-world data: Statements collected from natural contexts (political debates, TV interviews, Facebook posts, tweets, news releases) rather than crowdsourced simulations, improving ecological validity.
  • Rich metadata: Each statement includes speaker name, party affiliation, job title, home state, and credit history (counts of previous inaccurate statements by label), enabling research on how speaker credibility influences detection.
  • Detailed justifications: Each label is accompanied by a lengthy analysis report from POLITIFACT.COM and links to supporting documents, enabling fact-checking research.
  • Empirical benchmark: Evaluation of multiple baselines including logistic regression, SVM, Bi-LSTM, and CNN models, with a novel hybrid CNN architecture that integrates text and metadata features.

Dataset Description

Statistics

  • Total statements: 12,836
  • Training set: 10,269
  • Validation set: 1,284
  • Testing set: 1,283
  • Average statement length: 17.9 tokens
  • Speaker distribution: Democrats (4,150), Republicans (5,687), None/social media (2,185)
  • Label distribution: Well-balanced except pants-fire (1,050). Other labels: 2,063–2,638 instances per label.
  • Temporal span: 2007–2016 (decade-long)

Labels

Six fine-grained veracity categories: 1. Pants-fire: Completely false or fabricated claims 2. False: False claims without significant truth 3. Barely-true: Claims with small elements of truth 4. Half-true: Partially true and partially false claims 5. Mostly-true: Claims that are mostly accurate with minor inaccuracies 6. True: Accurate statements

Metadata

Each statement includes: - Speaker name, party (Democrat/Republican/None) - Current job title and home state - Context/venue (e.g., TV interview, debate, Facebook post, tweet, news release) - Subject/topic (economy, healthcare, taxes, federal budget, education, jobs, state budget, elections, immigration, etc.) - Credit history: A vector of historical counts of previous statements labeled as pants-fire, false, barely-true, half-true, and mostly-true

Quality Control

Human agreement (Cohen's kappa) on a random sample of 200 statements: 0.82, indicating substantial inter-rater reliability.

Methodology

Statements were collected from POLITIFACT.COM's API. The dataset includes both the original detailed analysis reports and links to all supporting documents provided by PolitiFact editors, grounding each judgment in evidence.

Empirical Evaluation

Baselines Tested

  • Majority baseline: 0.204–0.208 accuracy
  • Logistic regression: 0.257 validation, 0.247 test
  • SVM: 0.258 validation, 0.255 test
  • Bi-LSTM: 0.223 validation, 0.233 test (overfitting)
  • CNN (text-only): 0.260 validation, 0.270 test
  • Hybrid CNN (text + all metadata): 0.247 validation, 0.274 test (best result)

Hybrid Architecture

The proposed neural network integrates text and metadata: 1. Text branch: Word embeddings → CNN with filter sizes (2,3,4) → max-pooling 2. Metadata branch: Metadata embeddings → CNN → max-pooling → Bi-LSTM 3. Concatenation of both representations 4. Fully connected layer with softmax for 6-way classification

The metadata-enhanced model achieved the best test performance (0.274 accuracy), demonstrating that speaker credibility history and context information improve fine-grained fakeness classification.

Connections

Notes

This paper makes an essential contribution by releasing a substantially larger labeled dataset than previously available. The move from crowdsourced simulated data (deceptive reviews) to authentic political statements from PolitiFact greatly improves relevance and generalizability. The inclusion of detailed analysis reports and supporting documents enables both supervised detection research and future work in automatic fact-checking over knowledge bases.

The key insight—that metadata features (especially speaker credit history) provide signal orthogonal to text—validates the intuition that veracity assessment benefits from context. The hybrid CNN architecture is relatively simple by modern standards, but the baseline improvements are meaningful and have inspired subsequent work on incorporating structured metadata and speaker reputation in detection models.

The dataset's focus on short statements (average 17.9 tokens) from political contexts reflects the real-world problem of fact-checking claims made in debates, interviews, and social media, though this specificity also limits generalization to other domains (science, health misinformation, etc.). Nevertheless, LIAR has become a standard benchmark and is widely cited in fact-checking and fake news detection literature.