Distributed Representations of Words and Phrases and their Compositionality¶

Authors: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean Venue: ICLR Workshop 2013 — arXiv

TL;DR¶

Extends the Skip-gram model with techniques for learning phrase representations and demonstrates that word/phrase vectors exhibit strong compositionality via vector arithmetic. Introduces negative sampling as a simpler alternative to hierarchical softmax, subsampling of frequent words for computational speedup, and a simple data-driven phrase identification method; achieves 72% accuracy on phrase analogy tasks and shows that vector addition meaningfully combines semantic properties (e.g., "Russia" + "river" ≈ "Volga River").

Contributions¶

Negative sampling: A simplified variant of Noise Contrastive Estimation that trains faster and achieves better accuracy than hierarchical softmax, especially for frequent words
Subsampling of frequent words: A principled approach to discard high-frequency words during training (equation 5) that yields 2–10× speedup and improves accuracy of rare-word representations
Phrase identification: A data-driven method using unigram/bigram frequency scores to identify phrases (e.g., "New York Times", "Boston Bruins") and treat them as single tokens during training
Phrase analogy task: A new benchmark with 3,218 phrase analogies across five categories (newspapers, sports teams, airlines, executives) for evaluating phrase representation quality
Evidence of compositional structure: Demonstration that element-wise vector addition produces meaningful results (e.g., "Czech" + "currency" ≈ "koruna")

Method¶

Negative Sampling (NEG). Replaces the full softmax objective with a simplified binary classification task. For each positive context pair (word wₜ, context word wₜ₊ⱼ), the model learns to distinguish the true target from k negative samples drawn from noise distribution Pₙ(w). Objective:

log σ(v'ₚₒ ⊤ vₚᵢ) + Σᵢ₌₁ᵏ 𝔼ᵥᵢ∼Pₙ(ᵥ) [log σ(−v'ᵥᵢ ⊤ vₚᵢ)]

Experiments show k=5–20 works well for small datasets, k=2–5 for large datasets. Crucially, the noise distribution Pₙ(w) is set to the unigram distribution raised to the 3/4 power, which outperforms uniform distributions.

Subsampling. Each word wᵢ is kept with probability P(wᵢ) = 1 − √(t / f(wᵢ)), where f(wᵢ) is word frequency and t ≈ 10⁻⁵. This aggressively subsamples very frequent words while preserving ranking of frequencies. Effect: both accelerates training (by removing uninformative context) and improves rare-word vectors.

Phrase Identification. Computes phrase score as:

score(wᵢ, wⱼ) = (count(wᵢwⱼ) − δ) / (count(wᵢ) × count(wⱼ))

where δ is a discounting coefficient preventing spurious phrases from rare words. Bigrams exceeding a threshold become single tokens. Typically runs 2–4 passes with decreasing thresholds to form longer multi-word phrases.

Vector Compositionality. The paper shows word vectors exhibit linear structure enabling meaningful addition. Explanation: vectors are in linear relationship with softmax inputs; vectors represent context distributions; sum of two word vectors relates to the product (AND function) of context distributions.

Results¶

Negative Sampling vs. Hierarchical Softmax: - Negative sampling (k=15, 10⁻⁵ subsampling): 61% accuracy on word analogies - Hierarchical softmax: 55% (without subsampling), 55% (with subsampling) - NEG-15 training time: 36 min vs. HS-Huffman 21 min, but accuracy superior

Phrase Analogy Task (with 33B-word training corpus): - Best model (hierarchical softmax + subsampling, 1000-dim vectors): 72% accuracy - Reduced data (6B words): 66% accuracy (demonstrates data importance) - NEG-15 with subsampling: 42% on phrase dataset (vs. 27% without subsampling)

Subsampling Impact (word analogies): - Without 10⁻⁵ subsampling: 2–8 min training - With 10⁻⁵ subsampling: 2–4 min training, higher accuracy across all methods - Accuracy gain: +1–4 percentage points

Compositional Structure Examples (element-wise addition): - Czech + currency → koruna, crown, zolty (currency-related words) - Vietnam + capital → Hanoi, Ho Chi Minh City - German + airlines → Lufthansa, carrier Lufthansa - Russian + river → Moscow, Volga River, upriver - French + actress → Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg

Connections¶

Extends prior work on Skip-gram with phrase and improved training techniques
Foundation for word embedding methods now standard in NLP pipelines
Key reference for text embeddings used in downstream fake news detection systems
Related to NLP methods that rely on distributed representations
Precedes contextual embeddings (BERT, GPT) which address limitations of static embeddings but build on foundational concepts here

Notes¶

This paper extended the original Skip-gram model with practical training techniques (negative sampling, subsampling) and phrase handling that proved essential for real-world NLP systems. Negative sampling, in particular, became the standard training objective for embedding models due to its simplicity and superior accuracy-to-speed tradeoff. The phrase identification approach was simple but effective—treating multi-word expressions as atomic units dramatically improves representation quality compared to composing individual word vectors.

Key limitations: Static embeddings cannot handle polysemy (multiple senses of a word). The phrase detection method is data-driven but relatively naive compared to syntactic parsing. Later work (Sutskever et al. 2014; Mikolov et al. 2013 on fastText) extended these ideas to character-level information and learned morphology. Despite these limitations, word2vec remains widely deployed due to computational efficiency and ease of implementation. The semantic-syntactic test set and phrase analogy task remain standard benchmarks for representation quality.