DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature¶
Authors: Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, Chelsea Finn
Venue: PMLR 202, 2023
TL;DR¶
This paper identifies that text sampled from language models tends to occupy negative-curvature regions of the model's log-probability landscape, whereas human-written text occupies more neutral regions. DetectGPT leverages this observation to detect machine-generated text in a zero-shot manner by analyzing probability curvature, without requiring training data or knowledge of the generating model's architecture.
Contributions¶
- Identification of a fundamental property: LLM-generated text has lower log-probability curvature than human text under the generating model
- DetectGPT: a practical zero-shot algorithm for machine-generated text detection based on perturbation-induced curvature estimation
- Extensive empirical validation across multiple datasets (XSum, SQuAD, WritingPrompts) and language models (GPT-2, GPT-3, T5, GPT-NeoX)
- Demonstration that the method outperforms existing zero-shot baselines and is competitive with supervised approaches
Method¶
DetectGPT frames machine-generated text detection as a binary classification problem without access to source or target model internals. The core hypothesis is that a candidate passage x is likely generated by source model p_θ if it lies in negative-curvature regions of p_θ's log-probability landscape.
The method uses a perturbation function q(·|x) that generates semantically similar rewrites of x with minimal changes. For each perturbation x̃, the algorithm computes the log-probability under the source model p_θ. The perturbation discrepancy d(x, p_θ, q) measures the gap between the original passage and perturbed samples:
d(x, p_θ, q) ≈ log p_θ(x) - E_{x̃~q(·|x)} log p_θ(x̃)
This discrepancy approximates the negative Hessian trace of the log-probability function, a standard measure of curvature. The intuition is that if small perturbations of a passage significantly reduce its log-probability, the original passage lies in a negative-curvature region characteristic of LLM samples. A threshold on the perturbation discrepancy separates human-written from model-generated text.
DetectGPT uses T5-3B as the mask-filling perturbation function and evaluates on generic pre-trained language models without adaptation to specific domains.
Results¶
DetectGPT achieves strong performance across diverse settings:
- Zero-shot detection: 0.95 AUROC on GPT-2 detection; 0.84 AUROC on XSum; 0.84 AUROC on SQuAD
- Generalization: Outperforms existing zero-shot methods (rank, log-rank, entropy baselines) by substantial margins (0.01–0.1 AUROC improvement)
- Supervised comparison: Competitive with or exceeding supervised detectors trained on large datasets of real and generated passages
- Robustness: Maintains detection accuracy when the source model is unknown; DetectGPT performs better than existing baselines even when scoring with a different model than the source
Key ablations show that detection accuracy improves with mask-filling model size, number of perturbations, and that the method generalizes across languages and domains.
Connections¶
- Related to LLM Watermarking via shared interest in detecting model outputs, but DetectGPT requires no watermarking infrastructure
- Builds on prior work on Zero-shot learning approaches to detection without training data
- Cited by subsequent work on AI Generated Content Detection and forensic analysis of neural text
- Contrasts with supervised approaches like those in Deepfake Detection literature, offering a practical alternative when labeled data is unavailable
Notes¶
Strengths: The core observation about probability curvature is elegant and theoretically motivated. The paper carefully validates the hypothesis across multiple models, datasets, and ablations. The method is practical and requires only whitebox access to log-probabilities—a capability available in public APIs. Results are state-of-the-art for zero-shot detection at publication.
Limitations: The white-box assumption (access to model probabilities) limits applicability to closed-source or API-restricted models. Detection quality degrades for very long sequences due to T5's mask-filling approach. The method assumes the source model is accessible for scoring; cross-model detection is evaluated but shows some performance drop.
Significance: This work establishes probability curvature as a practical diagnostic for machine-generated text and opened a new direction for zero-shot detection without watermarking or auxiliary training. Given the rapid proliferation of LLM-generated content in media and educational contexts, this detection approach has clear practical importance for content verification and detection of AI-generated misinformation.