Skip to content

Bias in Language Models

Bias in language models refers to the systematic tendency of models to produce outputs that reflect or amplify social stereotypes, discrimination, or unfair treatment toward protected groups. This can manifest as gender bias (e.g., associating occupations with specific genders), racial bias (e.g., discriminatory decision-making based on race), age bias, and other forms of social prejudice.

Language models trained on large internet corpora inherit the biases present in human-generated text. These biases can be particularly harmful when models are deployed in high-stakes decision-making contexts like hiring, lending, or admissions. Research in this area focuses on three key questions: How do we measure bias in language models? What are the underlying mechanisms? How can we reduce bias without sacrificing model performance?

Key papers