Word embedding
A word embedding is a learned representation of text in which words are mapped to dense, real-valued vectors in a continuous vector space, typically of 50–1024 dimensions. Words that appear in similar contexts are mapped to nearby points, capturing semantic and syntactic relationships in a form that neural networks can process. Word embeddings are a foundational component of modern natural language processing (NLP) and were a key stepping stone toward the transformer architecture and large language models.
The core insight behind word embeddings is the distributional hypothesis, articulated by linguist John Rupert Firth in 1957: "You shall know a word by the company it keeps."[1] Words that co-occur in similar contexts (e.g., "cat" and "dog" both appear near "pet," "fur," "veterinarian") receive similar vector representations, even though the model is never explicitly told their meanings.
History
Pre-neural representations
Before word embeddings, NLP systems represented words as one-hot vectors — binary vectors of dimension equal to the vocabulary size (typically 50,000–500,000), with a single 1 at the word's index and 0s elsewhere. This representation treats every pair of words as equally dissimilar (all one-hot vectors are orthogonal), discarding all information about semantic relationships.
Latent Semantic Analysis (LSA; Deerwester et al., 1990) was an early attempt to learn dense representations by applying singular value decomposition (SVD) to a term–document co-occurrence matrix, projecting words into a lower-dimensional space where semantically related terms cluster together.[2] However, LSA was computationally expensive and did not scale well to very large vocabularies or corpora.
Neural language model embeddings (2003)
Yoshua Bengio's 2003 paper "A Neural Probabilistic Language Model" proposed learning word representations as part of a neural network language model. Each word was assigned a learnable feature vector, and the model predicted the next word in a sequence given the concatenated vectors of the preceding words.[3] This demonstrated that useful word representations could emerge from language modelling, but training was slow and the approach received limited adoption at the time.
Collobert and Weston (2008) showed that a single set of pre-trained word embeddings could improve performance across multiple NLP tasks (POS tagging, NER, chunking, semantic role labelling), anticipating the transfer learning paradigm that would later dominate the field.[4]
Word2Vec (2013)
The breakthrough came with Word2Vec, introduced by Tomáš Mikolov and colleagues at Google in two papers in 2013.[5][6] Word2Vec offered two architectures:
Continuous Bag-of-Words (CBOW)
CBOW predicts a target word from its surrounding context words. Given a window of context words (e.g., "the cat ___ on the"), the model averages their embedding vectors and passes the result through a single hidden layer to predict the missing word. CBOW is faster to train and works well for frequent words.
Skip-gram
Skip-gram inverts the task: given a target word, it predicts the surrounding context words. For each word in the corpus, the model generates training pairs of (target, context) within a sliding window. Skip-gram performs better on rare words and small datasets.
Both architectures use a shallow neural network (single hidden layer) and are trained with either hierarchical softmax or negative sampling to make training tractable on large vocabularies. Negative sampling, which trains the model to distinguish true context pairs from randomly sampled noise pairs, became the standard approach due to its efficiency and strong empirical results.
Algebraic properties
Word2Vec's most celebrated result was the emergence of linear algebraic relationships between word vectors:
- <math>\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}</math>
These word analogies demonstrated that the learned vector space captured relational meaning — not just similarity but structured semantic relationships including gender (man/woman), tense (walking/walked), country/capital (France/Paris), and comparative forms (big/bigger). This property was not designed into the architecture but emerged from the training objective.
GloVe (2014)
GloVe (Global Vectors for Word Representation) was developed by Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford.[7] Unlike Word2Vec, which learns from local context windows, GloVe directly factorises the global word–word co-occurrence matrix.
The key insight is that the ratio of co-occurrence probabilities encodes meaning. If ice co-occurs frequently with solid but rarely with gas, while steam shows the opposite pattern, the ratio of their co-occurrence probabilities with solid versus gas captures the semantic distinction. GloVe's objective function is designed to produce vectors whose dot products equal the logarithm of the co-occurrence counts.
GloVe achieved results comparable to or slightly better than Word2Vec on analogy and similarity benchmarks, while making the relationship between the training objective and matrix factorisation explicit. Pre-trained GloVe vectors (trained on Common Crawl, 840 billion tokens, 2.2 million vocabulary) became a standard starting point for NLP systems from 2014 to 2018.
FastText (2016)
FastText, developed by Facebook AI Research (Bojanowski et al., 2017), extended Word2Vec by representing each word as a bag of character n-grams.[8] The word "where" would be represented as the sum of embeddings for the character n-grams: <wh, whe, her, ere, re>, plus the whole-word token <where>.
This approach offered two major advantages:
- Morphological awareness: Related word forms (run, running, runner) share character n-grams and therefore receive similar embeddings, even without explicit morphological analysis.
- Out-of-vocabulary handling: Unknown words (misspellings, neologisms, rare technical terms) can be represented by summing the embeddings of their constituent n-grams, rather than being mapped to a single "unknown" vector.
FastText proved particularly effective for morphologically rich languages (Turkish, Finnish, Arabic) where the vocabulary of distinct word forms is much larger than in English.
Contextual embeddings (2018)
A fundamental limitation of Word2Vec, GloVe, and FastText is that each word receives a single, static embedding regardless of context. The word "bank" gets the same vector whether it means a financial institution or a river bank. This fails to capture polysemy — the fact that most common words have multiple meanings.
ELMo
ELMo (Embeddings from Language Models; Peters et al., 2018) addressed this by generating contextual word representations using a bidirectional LSTM language model.[9] Instead of looking up a fixed vector, ELMo runs the input sentence through a pre-trained bidirectional LSTM and produces a context-dependent embedding for each token by combining the hidden states from all layers.
ELMo achieved substantial improvements on six NLP benchmarks when added as input features to existing task-specific architectures, demonstrating the value of contextual representations.
Transformer-based embeddings
BERT (Devlin et al., 2018) and GPT (Radford et al., 2018) pushed contextual embeddings further by replacing LSTMs with the transformer architecture. BERT's bidirectional self-attention produces embeddings that are conditioned on the entire input sequence in both directions, capturing richer contextual information than ELMo's left-to-right and right-to-left LSTMs.
In modern large language models, the concept of a "word embedding" has evolved: the initial embedding layer maps tokens to vectors (similar in spirit to Word2Vec), but the transformer's successive layers produce increasingly context-dependent representations. The "embedding" of a token at the final layer is a function of the entire input sequence.
Technical properties
Dimensionality
Typical embedding dimensions range from 50 (lightweight GloVe) to 300 (standard Word2Vec/GloVe) to 768 (BERT-base) to 12,288 (GPT-4-scale models). Higher dimensions can capture more information but require more data to train effectively and more memory at inference time.
Training data and bias
Word embeddings reflect the statistical patterns of their training data, including human biases. Bolukbasi et al. (2016) demonstrated that Word2Vec embeddings trained on Google News systematically associated male names with career terms and female names with family terms.[10] Debiasing methods (projecting out gender subspaces, data augmentation, contrastive training) have been developed but remain an active research area.
Evaluation
Word embeddings are evaluated on:
- Intrinsic tasks: Word analogy (Google analogy dataset), word similarity (SimLex-999, WordSim-353), and categorisation tasks.
- Extrinsic tasks: Performance when used as input features for downstream NLP tasks (NER, sentiment analysis, parsing).
The gap between intrinsic and extrinsic performance — embeddings that score well on analogies don't always help on downstream tasks — has led the field to focus increasingly on extrinsic evaluation and task-specific fine-tuning.
Legacy and significance
Word embeddings were a pivotal development in the history of NLP and deep learning:
- They demonstrated that unsupervised pre-training on large text corpora could produce useful representations — the same principle later scaled up by BERT, GPT, and modern LLMs.
- They showed that continuous vector representations outperform discrete symbolic representations for language, establishing the representational foundation for neural NLP.
- They popularised the transfer learning paradigm in NLP: train representations once on a large corpus, then reuse them across many tasks.
- The analogy results (king − man + woman ≈ queen) captured public imagination and helped communicate the power of neural approaches beyond the research community.
While static word embeddings have been largely superseded by contextual representations from transformers in state-of-the-art systems, they remain widely used in resource-constrained settings, as input features for non-transformer models, and as a teaching tool for understanding distributed representations.
See also
- Natural language processing
- Transformer (machine learning)
- BERT
- Attention (machine learning)
- Deep learning
- Large language model
- Recurrent neural network
References
- ↑ Firth, J. R. (1957). "A synopsis of linguistic theory, 1930–1955." Studies in Linguistic Analysis, 1–32.
- ↑ Deerwester, S., et al. (1990). "Indexing by latent semantic analysis." Journal of the American Society for Information Science, 41(6), 391–407.
- ↑ Bengio, Y., et al. (2003). "A Neural Probabilistic Language Model." Journal of Machine Learning Research, 3, 1137–1155.
- ↑ Collobert, R. and Weston, J. (2008). "A Unified Architecture for Natural Language Processing." ICML 2008.
- ↑ Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781.
- ↑ Mikolov, T., et al. (2013). "Distributed Representations of Words and Phrases and their Compositionality." NeurIPS 2013.
- ↑ Pennington, J., Socher, R., and Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." EMNLP 2014.
- ↑ Bojanowski, P., et al. (2017). "Enriching Word Vectors with Subword Information." Transactions of the ACL, 5, 135–146.
- ↑ Peters, M. E., et al. (2018). "Deep contextualized word representations." NAACL 2018.
- ↑ Bolukbasi, T., et al. (2016). "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." NeurIPS 2016.