OLID¶
Offensive Language Identification Dataset (OLID) is a large-scale annotated corpus of 14,100 English tweets (13,240 training, 860 test) for offensive language detection and characterization. The dataset uses a three-level hierarchical annotation schema:
Level A (Offensive Language Detection): Binary classification of tweets as NOT (non-offensive) or OFF (offensive, including profanity, insults, threats).
Level B (Categorization of Offensive Language): Classification of offensive posts as TIN (Targeted Insult/threat directed at individual or group) or UNT (Untargeted profanity and swearing).
Level C (Offensive Language Target Identification): Classification of targeted insults as: - IND (Individual — corresponds to cyberbullying) - GRP (Group — corresponds to hate speech) - OTH (Other — organization, situation, event, issue)
Collection¶
Tweets were collected using Twitter API with keywords designed to capture offensive content. Keywords were selected through trial annotation to maintain approximately 30% offensive content distribution. The full dataset balances political keywords (50%) and non-political keywords (50%) evenly.
Annotation Quality¶
Crowdsourced annotation via Figure Eight platform with quality control measures: - Only experienced annotators used - Test questions to filter low-quality annotators - Multiple annotators per instance - Majority voting for disagreements - Inter-annotator agreement: Fleiss' κ = 0.83 for Level A
Approximately 60% of tweets received agreement on first two annotations; remaining instances received third annotation.
Label Distribution¶
| Level | Class | Training | Test | Total |
|---|---|---|---|---|
| A | NOT | 8,840 | 620 | 9,460 |
| A | OFF | 4,400 | 240 | 4,640 |
| B (OFF only) | TIN | 3,876 | 213 | 4,089 |
| B (OFF only) | UNT | 524 | 27 | 551 |
| C (TIN only) | IND | 2,407 | 100 | 2,507 |
| C (TIN only) | GRP | 1,074 | 78 | 1,152 |
| C (TIN only) | OTH | 395 | 35 | 430 |
Note: Significant class imbalance in Levels B and C; OTH class particularly sparse.
Baseline Performance¶
Experiments with SVM (on unigrams), BiLSTM, and CNN models:
- Level A (OFF/NOT): CNN best with macro-F1 0.80
- Level B (TIN/UNT): CNN best with macro-F1 0.69
- Level C (IND/GRP/OTH): CNN and BiLSTM tied at macro-F1 0.47; OTH class unlearnable (F1=0.00)
Usage in the Wiki¶
Papers using OLID: - Predicting the Type and Target of Offensive Posts in Social Media — original paper introducing OLID
Notes¶
OLID became the official dataset for SemEval 2019 Task 6 (OffensEval). The hierarchical annotation scheme has since been applied to additional languages (Hindi, Arabic, etc.). The dataset's three-level structure unifies previously fragmented work on hate speech detection, cyberbullying identification, and general offensive language detection.