Text embeddings and word representations¶
Techniques for converting text into numerical vectors that preserve semantic and syntactic properties, enabling machine learning on textual data.
Approaches¶
Pre-trained word embeddings:
Vectors learned on large corpora (e.g., GloVe on Wikipedia/news, Word2Vec on Google News/Twitter) that capture general semantic relationships between words; fast, lightweight, and transferable across tasks.
Contextual embeddings:
Language model embeddings (BERT, RoBERTa, GPT) that generate word vectors dependent on surrounding context; more expressive than static embeddings but computationally expensive.
Document/sentence embeddings:
Aggregations of word embeddings (averaging, max-pooling, attention weighting) or specialized methods (doc2vec, Universal Sentence Encoder) for fixed-length representations of entire texts.
Specialized embeddings:
Domain-specific embeddings trained on medical texts, scientific papers, or social media; often outperform general-purpose embeddings on specialized tasks.
Key papers in this wiki¶
- Efficient Estimation of Word Representations in Vector Space — Introduces CBOW and Skip-gram architectures for efficiently learning word representations from large corpora; shows 50–300 dimensional vectors capture semantic and syntactic regularities enabling vector arithmetic (e.g., king − man + woman ≈ queen).
- Misinformation Detection on YouTube Using Video Captions — Compares four pre-trained embeddings (GloVe Wikipedia 100D/300D, Word2Vec Google News 300D, Word2Vec Twitter 200D) for YouTube caption classification; finds Google News and GloVe 300D most effective; discusses limitations and proposes training embeddings directly on video caption data.
Related topics¶
- Natural Language Processing — text analysis and feature engineering
- Language Models — neural approaches to learning representations
- Text classification — downstream application of embeddings