Toxicity Detection with Generative Prompt-based Inference¶
Authors: Yau-Shian Wang, Yingshan Chang Venue: arXiv, 2022 — arXiv:2205.12390
TL;DR¶
This paper proposes a generative variant of zero-shot prompt-based toxicity detection using language models, comparing it against discriminative approaches. By formulating the problem as estimating the probability of generating a test instance conditioned on toxicity-guided prompts, the method achieves better performance than discriminative classification on three social media toxicity datasets (SBIC, HateExplain, Civility) through careful prompt engineering.
Contributions¶
- Extends zero-shot toxicity detection from discriminative to generative classification formulations, showing that generative approaches allow more flexibility for LLMs to distinguish toxic from non-toxic content
- Demonstrates the sensitivity of both approaches to prompt wording and provides analysis of what makes effective prompts across different toxicity datasets
- Shows that generative classification with task demonstrations further improves performance, especially on positive class detection
- Provides ethical discussion of the dual-use risks and potential biases inherent in toxicity detection via language model self-diagnosis
Method¶
The authors formalize toxicity detection as zero-shot binary classification without task-specific supervision. They compare two prompt-based approaches:
Discriminative classification reformulates the problem as a cloze test, asking the LLM directly: "Does the above text contain y?" and interpreting the probability of "Yes" versus "No" as p(y|x).
Generative classification estimates the joint probability p(x,y) ∝ p(x|y) by computing the log likelihood that an LLM generates the test text conditioned on prompts describing toxicity: - Positive prompt: "Write a text that contains [toxicity description]" - Negative prompt: "Write a text that doesn't contain [toxicity description]"
The classification score is: s(y^p) = Σ_t log p_M(x_t | y^p, x_{<t})
The method is extended with task demonstration by sampling unlabeled examples from the target dataset to guide generation toward the target task distribution, computed as an ensemble score across multiple demonstrations.
Results¶
Main results on three datasets using GPT-2-large:
| Approach | SBIC macro-F1 | HX macro-F1 | Civility macro-F1 |
|---|---|---|---|
| Toxicity Lexicon | 0.55 | 0.49 | 0.65 |
| Embedding Similarity | 0.54 | 0.36 | 0.51 |
| Discriminative Cls | 0.55 | 0.50 | 0.53 |
| Generative Cls | 0.54 | 0.54 | 0.48 |
| Generative Cls+demo | 0.60 | 0.57 | 0.57 |
Generative classification with demonstrations achieves the best overall performance. The paper finds that model size has a positive correlation with generative approach performance but no clear trend for discriminative classification.
Connections¶
- Related to Cognitive reflection correlates with behavior on Twitter and other prompt-based NLP approaches through the use of language model capabilities without fine-tuning
- Relevant to dEFEND: Explainable Fake News Detection on explainability in toxicity detection systems
- Builds on prompt engineering insights from The echo chamber effect on social media and survey work on NLP detection methods
Notes¶
The paper makes important observations: (1) Both discriminative and generative methods are highly sensitive to prompt wording, requiring dataset-specific design; (2) the main advantage of generative classification is "improved contrastiveness"—longer generation sequences provide more opportunities to distinguish toxic from non-toxic; (3) qualitative LIME analysis shows the model sometimes relies on spurious correlations between benign tokens and toxicity labels learned during pre-training.
The authors honestly discuss ethical implications: the approach's reliance on language models trained on toxic content means the ability to detect toxicity stems from the same capacity that could generate harmful content (dual-use risk), and there is potential for demographic biases in the detection model itself. They note that future corpus filtering may eliminate the toxic co-occurrences that enable "self-diagnosis," which could degrade performance.