OLID¶

Offensive Language Identification Dataset (OLID) is a large-scale annotated corpus of 14,100 English tweets (13,240 training, 860 test) for offensive language detection and characterization. The dataset uses a three-level hierarchical annotation schema:

Level A (Offensive Language Detection): Binary classification of tweets as NOT (non-offensive) or OFF (offensive, including profanity, insults, threats).

Level B (Categorization of Offensive Language): Classification of offensive posts as TIN (Targeted Insult/threat directed at individual or group) or UNT (Untargeted profanity and swearing).

Level C (Offensive Language Target Identification): Classification of targeted insults as: - IND (Individual — corresponds to cyberbullying) - GRP (Group — corresponds to hate speech) - OTH (Other — organization, situation, event, issue)

Collection¶

Tweets were collected using Twitter API with keywords designed to capture offensive content. Keywords were selected through trial annotation to maintain approximately 30% offensive content distribution. The full dataset balances political keywords (50%) and non-political keywords (50%) evenly.

Annotation Quality¶

Crowdsourced annotation via Figure Eight platform with quality control measures: - Only experienced annotators used - Test questions to filter low-quality annotators - Multiple annotators per instance - Majority voting for disagreements - Inter-annotator agreement: Fleiss' κ = 0.83 for Level A

Approximately 60% of tweets received agreement on first two annotations; remaining instances received third annotation.

Label Distribution¶

Level	Class	Training	Test	Total
A	NOT	8,840	620	9,460
A	OFF	4,400	240	4,640
B (OFF only)	TIN	3,876	213	4,089
B (OFF only)	UNT	524	27	551
C (TIN only)	IND	2,407	100	2,507
C (TIN only)	GRP	1,074	78	1,152
C (TIN only)	OTH	395	35	430

Note: Significant class imbalance in Levels B and C; OTH class particularly sparse.

Baseline Performance¶

Experiments with SVM (on unigrams), BiLSTM, and CNN models:

Level A (OFF/NOT): CNN best with macro-F1 0.80
Level B (TIN/UNT): CNN best with macro-F1 0.69
Level C (IND/GRP/OTH): CNN and BiLSTM tied at macro-F1 0.47; OTH class unlearnable (F1=0.00)

Usage in the Wiki¶

Papers using OLID: - Predicting the Type and Target of Offensive Posts in Social Media — original paper introducing OLID

Notes¶

OLID became the official dataset for SemEval 2019 Task 6 (OffensEval). The hierarchical annotation scheme has since been applied to additional languages (Hindi, Arabic, etc.). The dataset's three-level structure unifies previously fragmented work on hate speech detection, cyberbullying identification, and general offensive language detection.