Skip to content

Evaluation metrics for language models

Evaluation metrics quantitatively measure language model performance across diverse tasks and capabilities. These metrics range from simple accuracy counts to sophisticated semantic similarity measures, each with strengths and limitations.

Categories of metrics

Token-level metrics: - Accuracy: proportion of correct predictions - Precision, recall, F1: performance on imbalanced classes

String-level metrics: - BLEU: n-gram overlap with reference translation - ROUGE: recall-oriented measures for abstractive summarization - METEOR: matching with synonyms and stems

Semantic metrics: - BERTScore: contextual embedding similarity to references - BLEURT: learned metric predicting human judgment - CIDEr, SPICE: for vision-language tasks

Reference-free metrics: - Perplexity: inverse log-probability of held-out text - Self-BLEU: n-gram diversity - Factuality scores: measuring grounding in facts

Human evaluation: - Expert ratings on fluency, coherence, factuality - Likert scale judgments - Pairwise preference judgments

Challenges in LLM evaluation

Metric-performance misalignment: Automatic metrics often correlate poorly with human judgment, especially for generative tasks.

Task specificity: Different tasks require different metrics; a single metric cannot work universally.

Robustness: Metrics may be gamed; small perturbations can drastically change scores.

Hallucination blindness: Many metrics don't catch factually incorrect but fluent text.

Cost: Human evaluation is expensive and slow.

Key papers

  • A Survey on Evaluation of Large Language Models — comprehensive survey covering all major evaluation metrics, their strengths, and limitations across NLP tasks
  • [[2023-wang-medical-summarization-metrics]] — analysis of metric-human disagreement in medical multi-document summarization