Toxicity detection¶
Automatic detection and classification of toxic, offensive, and abusive language in text. Toxicity detection encompasses a range of harmful content including hate speech, harassment, insults, and offensive remarks. Detection is challenging due to implicit toxicity, cultural context dependence, and the diverse interpretations of what constitutes harmful content across different communities.
Key papers¶
- Toxicity in ChatGPT: Analyzing Persona-assigned Language Models — empirical analysis of toxicity generation (rather than detection) in ChatGPT; shows persona assignment dramatically increases toxic output and reveals discriminatory bias across entity categories
- Hate Lingo — Linguistic and psycholinguistic characterization of directed vs. generalized hate speech using SAGE, LIWC, and frame semantics
- Toxicity Detection with Generative Prompt-based Inference — Zero-shot generative prompt-based toxicity detection on social media datasets
Related topics¶
- Hate speech detection — a specific category of toxic content targeting groups
- Content moderation — broader content moderation strategies and systems
- Implicit Bias — understanding demographic biases in toxicity detection models