Toxicity detection¶

Automatic detection and classification of toxic, offensive, and abusive language in text. Toxicity detection encompasses a range of harmful content including hate speech, harassment, insults, and offensive remarks. Detection is challenging due to implicit toxicity, cultural context dependence, and the diverse interpretations of what constitutes harmful content across different communities.

Key papers¶

Toxicity in ChatGPT: Analyzing Persona-assigned Language Models — empirical analysis of toxicity generation (rather than detection) in ChatGPT; shows persona assignment dramatically increases toxic output and reveals discriminatory bias across entity categories
Hate Lingo — Linguistic and psycholinguistic characterization of directed vs. generalized hate speech using SAGE, LIWC, and frame semantics
Toxicity Detection with Generative Prompt-based Inference — Zero-shot generative prompt-based toxicity detection on social media datasets

Hate speech detection — a specific category of toxic content targeting groups
Content moderation — broader content moderation strategies and systems
Implicit Bias — understanding demographic biases in toxicity detection models

Toxicity detection¶

Key papers¶

Related topics¶