Robustness in language models¶
Robustness refers to a model's ability to maintain performance and reliability when facing distribution shifts, adversarial inputs, and unexpected perturbations. A robust model performs similarly on in-distribution and out-of-distribution data, and resists intentional attacks.
Forms of distribution shift¶
Covariate shift: Input distribution changes but conditional output distribution remains constant (e.g., different image quality in vision tasks).
Label shift: Class frequencies change across train and test (e.g., training on balanced data but testing on imbalanced data).
Concept drift: The relationship between input and output changes over time (e.g., meaning of words evolving).
Domain shift: Distinct new domain with different characteristics (e.g., medical NLP trained on news text then applied to clinical notes).
Adversarial perturbations¶
Character-level attacks: Typos, misspellings, and character insertions/deletions.
Word-level attacks: Synonym replacement, word swaps, paraphrasing.
Sentence-level attacks: Negation flipping, semantic perturbations, deliberate contradictions.
Prompt-based attacks: Adversarial prompts designed to trigger misbehavior.
Adversarial attacks and defenses¶
A Comprehensive Survey on Trustworthy Graph Neural Networks: Privacy, Robustness, Fairness, and Explainability surveys robustness of graph neural networks against adversarial attacks on both node features and graph structure, covering certified defenses and empirical robustness evaluations.
Measurement and benchmarks¶
Out-of-distribution benchmarks: - GLUE, SuperGLUE: include evaluation on OOD examples - AdvGLUE: adversarial variants of standard benchmarks - Domain-specific shifts (e.g., medical vs. news text)
Adversarial robustness metrics: - Certified robustness: provably bounded perturbations where model guarantees correct predictions - Empirical robustness: measured accuracy under adversarial attack - Attack success rate under different threat models
Challenges¶
Performance-robustness tradeoff: Maximizing clean accuracy sometimes comes at the cost of robustness, and vice versa.
Computational cost: Adversarial training and certified robustness computation are expensive.
Generalization of defenses: Defenses against one attack type may not transfer to others.
Scale: Robustness at scale (billions of parameters) is less well-studied than smaller models.
Key papers¶
- A Survey on Evaluation of Large Language Models — survey covering robustness evaluation across out-of-distribution generalization and adversarial attacks
Related topics¶
- Large Language Models
- Adversarial Attacks
- Model safety
- Misinformation — adversarially crafted disinformation is a robustness challenge