Skip to content
Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter

Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter

Author: Zeerak Waseem Venue: EMNLP Workshop on Natural Language Processing and Computational Social Science, 2016

TL;DR

This paper compares hate speech detection systems trained on expert annotations (from feminist and anti-racism activists) versus amateur annotations (crowdsourced via CrowdFlower) on 6,909 tweets. Systems trained on expert annotations significantly outperform those trained on amateur data, and amateur annotators systematically label more content as hateful. The work demonstrates that annotation expertise fundamentally shapes the quality and reliability of hate speech datasets.

Contributions

  • Annotation of 6,909 tweets for hate speech by both expert and amateur annotators, extending Waseem and Hovy (2016) by 4,033 tweets
  • Empirical comparison of classification models trained on expert versus amateur annotations
  • Introduction of an intersectional annotation scheme that includes a "both racism and sexism" label
  • Analysis of inter-annotator agreement and systematic biases between annotation groups
  • Evaluation showing that full agreement from amateurs can yield annotations comparable to expert work

Method

The paper recruits two annotation groups for the same tweet dataset:

Expert annotators: Feminist and anti-racism activists given explicit hate speech tests from Waseem and Hovy (2016). They could skip tweets if uncertain and label non-English content as "Noise".

Amateur annotators: CrowdFlower workers without selection, presented with 6,909 expert-annotated tweets. They had no skip option and were not shown tweets experts had skipped.

The labeling scheme includes categories: racism, sexism, neither, and both (capturing intersectional oppression via Crenshaw, 1989).

Classification features span textual (character n-grams, token n-grams, length) and extra-linguistic (POS tags, Brown clusters, gender probability, Author Historical Salient Terms) information. The authors use 5-fold cross-validation for evaluation and grid search for feature selection.

Results

Annotation agreement: Amateur inter-annotator agreement (κ = 0.57) was much lower than expert agreement (κ = 0.70 for full agreement). The agreement with Waseem and Hovy (2016) was extremely low (κ = 0.14), mostly due to amateurs failing to label hateful content that the original dataset marked as hateful.

Label distribution: Amateur majority-voted labels showed 5.80% racism and 19.00% sexism; experts showed 1.41% and 13.08%. Amateur full-agreement distributions were closer to expert (0.69% racism, 14.02% sexism).

Classification performance: On expert annotations, the best model achieved F1 = 91.19 (character n-grams, token n-grams, skip-grams, length, Brown clusters). On amateur majority-voted annotations, F1 = 83.88 (different feature set). When tested on the Waseem and Hovy (2016) data, the system achieved F1 = 70.05 (binary) but F1 = 53.43 (multi-class), underperforming the original paper, largely due to false positives.

Feature insights: Textual features (n-grams) dominated expert-trained models; extra-linguistic features (POS, gender, Brown clusters) mattered more for amateur-trained models. Author Historical Salient Terms performed surprisingly poorly, especially with expert annotations, suggesting hate speech is more situational than author-specific.

Connections

Notes

Strengths: The paper rigorously measures annotator influence with clear experimental design and honest reporting of underperformance on held-out data. The introduction of the "both" category is a thoughtful methodological contribution grounded in intersectional theory, not just annotation convenience. The finding that full agreement from amateurs can be reliable is practically valuable.

Weaknesses: The treatment of experts as a "single entity" for privacy obscures individual variation and limits deeper analysis. The poor performance on the multi-class task (F1 = 53.43) raises questions about whether the task itself is well-formulated; unweighted F1 may indeed be inappropriate for imbalanced hate speech data. The paper acknowledges this briefly but doesn't fully resolve it. The features are dated (2016 methods); neural models dominate detection now.

Open questions: How does expert annotator demographic composition influence the results? Does expertise primarily correlate with higher standards (fewer false positives) or deeper contextual knowledge? The paper doesn't separate these. The false-positive bias in held-out evaluation deserves more investigation.