Skip to content

Text embeddings and word representations

Techniques for converting text into numerical vectors that preserve semantic and syntactic properties, enabling machine learning on textual data.

Approaches

Pre-trained word embeddings:
Vectors learned on large corpora (e.g., GloVe on Wikipedia/news, Word2Vec on Google News/Twitter) that capture general semantic relationships between words; fast, lightweight, and transferable across tasks.

Contextual embeddings:
Language model embeddings (BERT, RoBERTa, GPT) that generate word vectors dependent on surrounding context; more expressive than static embeddings but computationally expensive.

Document/sentence embeddings:
Aggregations of word embeddings (averaging, max-pooling, attention weighting) or specialized methods (doc2vec, Universal Sentence Encoder) for fixed-length representations of entire texts.

Specialized embeddings:
Domain-specific embeddings trained on medical texts, scientific papers, or social media; often outperform general-purpose embeddings on specialized tasks.

Key papers in this wiki

  • Efficient Estimation of Word Representations in Vector Space — Introduces CBOW and Skip-gram architectures for efficiently learning word representations from large corpora; shows 50–300 dimensional vectors capture semantic and syntactic regularities enabling vector arithmetic (e.g., king − man + woman ≈ queen).
  • Misinformation Detection on YouTube Using Video Captions — Compares four pre-trained embeddings (GloVe Wikipedia 100D/300D, Word2Vec Google News 300D, Word2Vec Twitter 200D) for YouTube caption classification; finds Google News and GloVe 300D most effective; discusses limitations and proposes training embeddings directly on video caption data.