Skip to content

OLID

Offensive Language Identification Dataset (OLID) is a large-scale annotated corpus of 14,100 English tweets (13,240 training, 860 test) for offensive language detection and characterization. The dataset uses a three-level hierarchical annotation schema:

Level A (Offensive Language Detection): Binary classification of tweets as NOT (non-offensive) or OFF (offensive, including profanity, insults, threats).

Level B (Categorization of Offensive Language): Classification of offensive posts as TIN (Targeted Insult/threat directed at individual or group) or UNT (Untargeted profanity and swearing).

Level C (Offensive Language Target Identification): Classification of targeted insults as: - IND (Individual — corresponds to cyberbullying) - GRP (Group — corresponds to hate speech) - OTH (Other — organization, situation, event, issue)

Collection

Tweets were collected using Twitter API with keywords designed to capture offensive content. Keywords were selected through trial annotation to maintain approximately 30% offensive content distribution. The full dataset balances political keywords (50%) and non-political keywords (50%) evenly.

Annotation Quality

Crowdsourced annotation via Figure Eight platform with quality control measures: - Only experienced annotators used - Test questions to filter low-quality annotators - Multiple annotators per instance - Majority voting for disagreements - Inter-annotator agreement: Fleiss' κ = 0.83 for Level A

Approximately 60% of tweets received agreement on first two annotations; remaining instances received third annotation.

Label Distribution

Level Class Training Test Total
A NOT 8,840 620 9,460
A OFF 4,400 240 4,640
B (OFF only) TIN 3,876 213 4,089
B (OFF only) UNT 524 27 551
C (TIN only) IND 2,407 100 2,507
C (TIN only) GRP 1,074 78 1,152
C (TIN only) OTH 395 35 430

Note: Significant class imbalance in Levels B and C; OTH class particularly sparse.

Baseline Performance

Experiments with SVM (on unigrams), BiLSTM, and CNN models:

  • Level A (OFF/NOT): CNN best with macro-F1 0.80
  • Level B (TIN/UNT): CNN best with macro-F1 0.69
  • Level C (IND/GRP/OTH): CNN and BiLSTM tied at macro-F1 0.47; OTH class unlearnable (F1=0.00)

Usage in the Wiki

Papers using OLID: - Predicting the Type and Target of Offensive Posts in Social Media — original paper introducing OLID

Notes

OLID became the official dataset for SemEval 2019 Task 6 (OffensEval). The hierarchical annotation scheme has since been applied to additional languages (Hindi, Arabic, etc.). The dataset's three-level structure unifies previously fragmented work on hate speech detection, cyberbullying identification, and general offensive language detection.