GLTR: Statistical Detection and Visualization of Generated Text¶
Authors: Sebastian Gehrmann, Hendrik Strobelt, Alexander M. Rush
Venue: arXiv:1906.04043 — Link
TL;DR¶
Gehrmann et al. present GLTR, a visual tool for detecting AI-generated text by analyzing statistical anomalies in language model output. The key insight is that language models sample from the head of the distribution (high-probability words), whereas humans use a wider vocabulary including low-probability choices. In a human-subjects study, GLTR improved fake text detection from 54% baseline accuracy to 72%, and the tool has already been widely deployed for forensic analysis of suspected generated content.
Contributions¶
- Statistical detection framework: Three simple tests measuring (1) word probability, (2) rank of the word in the predicted distribution, and (3) entropy of the predicted distribution; demonstrate that generated text concentrates on high-rank tokens while human text uses the tail of the distribution.
- GLTR visualization tool: Interactive web interface (http://gltr.io) with color-coded token highlighting (green for top-10, yellow for top-100, red for top-1000, purple for tail) and statistical panels showing top-k bucket distribution, probability histogram, and entropy distribution.
- Human-subjects evaluation: 35 students in NLP class improved from 54.2% detection accuracy without GLTR to 72.3% with the tool (18.1% improvement, p < 0.001); users reported the tool taught them to recognize artifacts of text generation systems.
- Robustness across sampling schemes: Methods generalize from white-box scenarios (access to generation model) to black-box (detect outputs from any large language model using a surrogate detector model like BERT or GPT-2).
- Practical deployment: GLTR has accumulated 30,000 demo page views and 21,000 blog post views within the first month, used by researchers, journalists, and policymakers to assess generated content credibility.
Method¶
Detection approach:
The paper assumes that language models generate fluent text by sampling from restricted regions of the distribution (top-k sampling, beam search, nucleus sampling, temperature-modulated sampling). Human writers, by contrast, exploit the full range of language distribution and often choose unexpected but contextually appropriate words. Three tests exploit this asymmetry:
- Test 1 (Word Probability): p_det(X_i = x_i | X_{1:i-1}), the predicted probability of the actual token. Generated text concentrates on high-probability tokens; human text includes lower-probability surprises.
- Test 2 (Token Rank): Rank of the actual token within the sorted probability distribution. Equivalently, binning into buckets (top-1, top-5, top-10, top-100, top-1000, tail). Figure shows humans use tail words 2.41× more frequently than generated text under GPT-2 (1.67× under BERT).
- Test 3 (Entropy): -Σ_w p_det(X_i=w|X_{1:i-1}) log p_det(X_i=w|X_{1:i-1}). Low entropy means the model was very confident; generated text often exhibits high-entropy contexts followed by top-1 predictions (overconfidence), while human text uses low-rank words even in low-entropy contexts.
GLTR system:
Backend implemented in PyTorch; supports multiple detection models (BERT, GPT-2 117M deployed publicly). For GPT-2, computes p_det(X_i | X_{1:i-1}) at each position i. For BERT, iteratively masks each correct token and uses a bidirectional context of 30 words per side, computing p_det(X_i | X_{i-30:i-1}, X_{i+1:i+30}), estimating per-position probability despite BERT's masked-language-modeling design.
Results¶
Discriminative performance:
Cross-validated classifiers on 50 articles each from 3 generated sources (GPT-2 with temp=0.7, GPT-2 with top-40 truncation, Washington Post Heliograf) and 3 human sources (children's book corpus, New York Times, Nature/Science abstracts):
| Feature | AUC |
|---|---|
| Bag of Words (baseline) | 0.63 ± 0.11 |
| Test 1 (GPT-2 avg probability) | 0.71 ± 0.25 |
| Test 2 (GPT-2 top-k buckets) | 0.87 ± 0.07 |
| Test 1 (BERT avg probability) | 0.70 ± 0.27 |
| Test 2 (BERT top-k buckets) | 0.85 ± 0.09 |
The Test 2 (rank-based) feature substantially outperforms word features alone, confirming that distributional properties (not content) drive detection.
Human-subjects study:
- Baseline (no interface): 54.2% accuracy, 56.0% trust rate (subjects over-trusting generated text)
- With GLTR overlay: 72.3% accuracy (18.1 percentage point improvement, p < 0.001)
- 42.1% of users reported the interface improved accuracy; 37.1% reported faster detection
- Qualitative observations: subjects noticed that generated text rarely uses synonyms or referential expressions, instead repeating entity names verbatim (violating centering theory); detected parallelism in sentence structure across Heliograf examples.
Connections¶
- Related to Defending Against Neural Fake News on the dual problem: GROVER generates neural fake news while GLTR detects it; both exploit understanding of language model artifacts.
- Shares detection philosophy with content-based detection via linguistic properties, but uniquely focuses on model-level distributional anomalies rather than text surface patterns.
- Complements stylometric approaches which use writing-style features; GLTR's distributional tests are orthogonal and more robust to author-specific variation.
- Foundational work in generated text detection category, establishing that white-box distributional analysis can transfer to black-box scenarios.
Notes¶
Strengths: - Simple, interpretable, and deployable: three statistical tests requiring only a language model, no additional training. - Bridges human and automated detection: the visualization teaches users what to notice about generated artifacts, improving human judgment even without the tool. - Robustness: tests generalize from GPT-2 to BERT to other models, suggesting the assumption about sampling-from-head is robust across model families and sampling schemes. - Real-world adoption: GLTR has seen substantial uptake, indicating practical demand for such tools.
Limitations: - Assumes models use biased (head-biased) sampling. Adversarial generation could force sampling from the tail to evade detection, though at the cost of reduced coherence. - Conditional generation (when given a hidden seed/prompt) may look different from unconditional text, and the paper only briefly explores this. - Human evaluation limited to 35 students; generalization to general populations and non-English text remains unexplored. - Evaluation is per-token level; no end-to-end detection benchmark (e.g., can the tool automatically classify full documents?).
Impact: The paper has become influential in demonstrating that simple statistical properties (distributional rank) outperform learned classifiers on generated-text detection, and that interactive visualization can teach both humans and inform automated systems.