Distributed Representations of Words and Phrases and their Compositionality¶
Authors: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean Venue: ICLR Workshop 2013 — arXiv
TL;DR¶
Extends the Skip-gram model with techniques for learning phrase representations and demonstrates that word/phrase vectors exhibit strong compositionality via vector arithmetic. Introduces negative sampling as a simpler alternative to hierarchical softmax, subsampling of frequent words for computational speedup, and a simple data-driven phrase identification method; achieves 72% accuracy on phrase analogy tasks and shows that vector addition meaningfully combines semantic properties (e.g., "Russia" + "river" ≈ "Volga River").
Contributions¶
- Negative sampling: A simplified variant of Noise Contrastive Estimation that trains faster and achieves better accuracy than hierarchical softmax, especially for frequent words
- Subsampling of frequent words: A principled approach to discard high-frequency words during training (equation 5) that yields 2–10× speedup and improves accuracy of rare-word representations
- Phrase identification: A data-driven method using unigram/bigram frequency scores to identify phrases (e.g., "New York Times", "Boston Bruins") and treat them as single tokens during training
- Phrase analogy task: A new benchmark with 3,218 phrase analogies across five categories (newspapers, sports teams, airlines, executives) for evaluating phrase representation quality
- Evidence of compositional structure: Demonstration that element-wise vector addition produces meaningful results (e.g., "Czech" + "currency" ≈ "koruna")
Method¶
Negative Sampling (NEG). Replaces the full softmax objective with a simplified binary classification task. For each positive context pair (word wₜ, context word wₜ₊ⱼ), the model learns to distinguish the true target from k negative samples drawn from noise distribution Pₙ(w). Objective:
log σ(v'ₚₒ ⊤ vₚᵢ) + Σᵢ₌₁ᵏ 𝔼ᵥᵢ∼Pₙ(ᵥ) [log σ(−v'ᵥᵢ ⊤ vₚᵢ)]
Experiments show k=5–20 works well for small datasets, k=2–5 for large datasets. Crucially, the noise distribution Pₙ(w) is set to the unigram distribution raised to the 3/4 power, which outperforms uniform distributions.
Subsampling. Each word wᵢ is kept with probability P(wᵢ) = 1 − √(t / f(wᵢ)), where f(wᵢ) is word frequency and t ≈ 10⁻⁵. This aggressively subsamples very frequent words while preserving ranking of frequencies. Effect: both accelerates training (by removing uninformative context) and improves rare-word vectors.
Phrase Identification. Computes phrase score as:
score(wᵢ, wⱼ) = (count(wᵢwⱼ) − δ) / (count(wᵢ) × count(wⱼ))
where δ is a discounting coefficient preventing spurious phrases from rare words. Bigrams exceeding a threshold become single tokens. Typically runs 2–4 passes with decreasing thresholds to form longer multi-word phrases.
Vector Compositionality. The paper shows word vectors exhibit linear structure enabling meaningful addition. Explanation: vectors are in linear relationship with softmax inputs; vectors represent context distributions; sum of two word vectors relates to the product (AND function) of context distributions.
Results¶
Negative Sampling vs. Hierarchical Softmax: - Negative sampling (k=15, 10⁻⁵ subsampling): 61% accuracy on word analogies - Hierarchical softmax: 55% (without subsampling), 55% (with subsampling) - NEG-15 training time: 36 min vs. HS-Huffman 21 min, but accuracy superior
Phrase Analogy Task (with 33B-word training corpus): - Best model (hierarchical softmax + subsampling, 1000-dim vectors): 72% accuracy - Reduced data (6B words): 66% accuracy (demonstrates data importance) - NEG-15 with subsampling: 42% on phrase dataset (vs. 27% without subsampling)
Subsampling Impact (word analogies): - Without 10⁻⁵ subsampling: 2–8 min training - With 10⁻⁵ subsampling: 2–4 min training, higher accuracy across all methods - Accuracy gain: +1–4 percentage points
Compositional Structure Examples (element-wise addition): - Czech + currency → koruna, crown, zolty (currency-related words) - Vietnam + capital → Hanoi, Ho Chi Minh City - German + airlines → Lufthansa, carrier Lufthansa - Russian + river → Moscow, Volga River, upriver - French + actress → Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg
Connections¶
- Extends prior work on Skip-gram with phrase and improved training techniques
- Foundation for word embedding methods now standard in NLP pipelines
- Key reference for text embeddings used in downstream fake news detection systems
- Related to NLP methods that rely on distributed representations
- Precedes contextual embeddings (BERT, GPT) which address limitations of static embeddings but build on foundational concepts here
Notes¶
This paper extended the original Skip-gram model with practical training techniques (negative sampling, subsampling) and phrase handling that proved essential for real-world NLP systems. Negative sampling, in particular, became the standard training objective for embedding models due to its simplicity and superior accuracy-to-speed tradeoff. The phrase identification approach was simple but effective—treating multi-word expressions as atomic units dramatically improves representation quality compared to composing individual word vectors.
Key limitations: Static embeddings cannot handle polysemy (multiple senses of a word). The phrase detection method is data-driven but relatively naive compared to syntactic parsing. Later work (Sutskever et al. 2014; Mikolov et al. 2013 on fastText) extended these ideas to character-level information and learned morphology. Despite these limitations, word2vec remains widely deployed due to computational efficiency and ease of implementation. The semantic-syntactic test set and phrase analogy task remain standard benchmarks for representation quality.