NLP evaluation¶
Evaluation in NLP encompasses both automatic metrics (BLEU, ROUGE, perplexity) and human judgment, with growing emphasis on understanding when metrics correlate with human assessments and best practices for collecting reliable human annotations.
Key papers¶
- TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP — Framework for standardized evaluation of NLP models under adversarial attack, unifying 16 attacks, 82+ models, and multiple task types.
- Clark et al. (2021) — All That's 'Human' Is Not Gold: Investigates how well untrained human evaluators can assess machine-generated text and tests lightweight training methods to improve evaluator accuracy.
Related topics¶
- Human evaluation — human-centered evaluation approaches
- Quality assessment — assessing system output quality
- Evaluation metrics for language models — automated evaluation metrics