Bias in Language Models¶

Bias in language models refers to the systematic tendency of models to produce outputs that reflect or amplify social stereotypes, discrimination, or unfair treatment toward protected groups. This can manifest as gender bias (e.g., associating occupations with specific genders), racial bias (e.g., discriminatory decision-making based on race), age bias, and other forms of social prejudice.

Language models trained on large internet corpora inherit the biases present in human-generated text. These biases can be particularly harmful when models are deployed in high-stakes decision-making contexts like hiring, lending, or admissions. Research in this area focuses on three key questions: How do we measure bias in language models? What are the underlying mechanisms? How can we reduce bias without sacrificing model performance?

Key papers¶

Measuring Political Bias in Large Language Models: What Is Said and How It Is Said — Framework for measuring political bias in LLM-generated content, separating stance and framing biases across 14 political topics
Discovering Language Model Behaviors with Model-Written Evaluations — Discovers that RLHF training can amplify political bias (toward liberal positions); larger models express stronger political views; demonstrates inverse scaling on political opinion tasks
A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities — covers detection methods and bias

Fairness in NLP (broader fairness considerations)
RLHF Training (technique for bias mitigation)
Model Alignment (aligning models with ethical values)

Bias in Language Models¶

Key papers¶

Related topics¶