Skip to content

Low-Resource Language Processing

Low-resource languages—comprising most of the world's languages—have limited parallel text, sparse labeled corpora, and minimal representation in pretraining datasets. NLP systems trained on these languages suffer from degraded performance, hallucinations, and biased outputs. Low-resource language processing aims to develop methods leveraging multilingual transfer, code-switching, data augmentation, and unsupervised learning to build practical systems despite data scarcity.

Machine translation to/from low-resource pairs exemplifies the challenge: models must generalize from limited parallel data while maintaining semantic fidelity in translation. Hallucinations are endemic to low-resource translation, where insufficient signal and training data mismatch drive unreliable output.

Key papers

  • [[2023-guerreiro-hallucinations-multilingual]] — shows that hallucination rates in machine translation exceed 10% for low-resource pairs and decline sharply with increasing resource availability