Evaluation metrics for language models¶
Evaluation metrics quantitatively measure language model performance across diverse tasks and capabilities. These metrics range from simple accuracy counts to sophisticated semantic similarity measures, each with strengths and limitations.
Categories of metrics¶
Token-level metrics: - Accuracy: proportion of correct predictions - Precision, recall, F1: performance on imbalanced classes
String-level metrics: - BLEU: n-gram overlap with reference translation - ROUGE: recall-oriented measures for abstractive summarization - METEOR: matching with synonyms and stems
Semantic metrics: - BERTScore: contextual embedding similarity to references - BLEURT: learned metric predicting human judgment - CIDEr, SPICE: for vision-language tasks
Reference-free metrics: - Perplexity: inverse log-probability of held-out text - Self-BLEU: n-gram diversity - Factuality scores: measuring grounding in facts
Human evaluation: - Expert ratings on fluency, coherence, factuality - Likert scale judgments - Pairwise preference judgments
Challenges in LLM evaluation¶
Metric-performance misalignment: Automatic metrics often correlate poorly with human judgment, especially for generative tasks.
Task specificity: Different tasks require different metrics; a single metric cannot work universally.
Robustness: Metrics may be gamed; small perturbations can drastically change scores.
Hallucination blindness: Many metrics don't catch factually incorrect but fluent text.
Cost: Human evaluation is expensive and slow.
Key papers¶
- A Survey on Evaluation of Large Language Models — comprehensive survey covering all major evaluation metrics, their strengths, and limitations across NLP tasks
- [[2023-wang-medical-summarization-metrics]] — analysis of metric-human disagreement in medical multi-document summarization