Tokenization (natural language processing)
Tokenization in natural language processing (NLP) is the process of converting raw text into a sequence of discrete units — tokens — that serve as the input to a machine learning model. The choice of tokenization method profoundly affects a model's vocabulary size, its ability to handle rare and out-of-vocabulary words, its performance on multilingual text, and even the effective cost of using an LLM API, since most providers price by token count.
Modern large language models almost universally use subword tokenization algorithms — primarily Byte Pair Encoding (BPE), WordPiece, and Unigram — which split text into units that fall between individual characters and whole words. This article covers the history, algorithms, implementation, and practical implications of tokenization in the context of neural language modelling.
History
Early approaches
The earliest NLP systems (1950s–1990s) used simple rule-based tokenization: split on whitespace and punctuation, possibly applying stemming or lemmatization. These word-level tokenizers create a fixed vocabulary of whole words. The approach suffers from a fundamental tension: a small vocabulary produces many out-of-vocabulary (OOV) tokens, while a large vocabulary inflates the embedding matrix and makes rare words poorly learned.
Character-level tokenization, in which each character is a token, eliminates the OOV problem entirely but produces very long sequences, making it difficult for models to learn long-range dependencies. Character-level models such as Zhang et al. (2015) showed competitive results on text classification but struggled with generation tasks.[1]
The subword revolution
The modern era of tokenization began with Sennrich, Haddow, and Birch's 2016 paper "Neural Machine Translation of Rare Words with Subword Units", which applied Byte Pair Encoding (BPE) — originally a data compression algorithm invented by Philip Gage in 1994 — to NLP.[2] The key insight was that common words should be kept whole, while rare words should be decomposed into meaningful subword units (e.g., "unhappiness" → "un" + "happi" + "ness"), allowing the model to generalise morphologically without needing every word form in its vocabulary.
Google's WordPiece algorithm (Schuster and Nakajima, 2012), originally developed for Japanese and Korean speech recognition and later adopted for BERT, uses a similar greedy merge strategy but selects merges that maximise the likelihood of the training data rather than simply choosing the most frequent pair.[3]
Kudo's Unigram model (2018), implemented in the SentencePiece library, takes the opposite approach: it starts with a large vocabulary and iteratively removes tokens that least reduce the overall likelihood, using an expectation-maximisation (EM) algorithm.[4]
Algorithms
Byte Pair Encoding (BPE)
BPE is the most widely used tokenization algorithm in modern LLMs, used by the GPT family, LLaMA, and many others.
Training procedure:
- Start with a base vocabulary of individual characters (or bytes).
- Count the frequency of every adjacent pair of tokens in the training corpus.
- Merge the most frequent pair into a single new token.
- Repeat steps 2–3 for a predetermined number of merges (typically 30,000–100,000).
The result is an ordered list of merge rules. At inference time, text is first split into characters, then the merge rules are applied greedily in the learned order.
Byte-level BPE: OpenAI's GPT-2 introduced a variant that operates on raw UTF-8 bytes rather than Unicode characters, ensuring that any text — regardless of language or encoding — can be tokenized without any OOV tokens.[5] This byte-level approach has become the standard for most modern LLMs.
WordPiece
WordPiece is used by BERT and related encoder models. It differs from BPE in the merge selection criterion:
- BPE merges the pair with the highest frequency.
- WordPiece merges the pair that maximises the likelihood of the training data — effectively choosing the merge that most reduces the perplexity of a unigram language model over the vocabulary.
WordPiece marks sub-tokens that continue a word with the prefix "##" (e.g., "playing" → "play" + "##ing"), distinguishing word-internal from word-initial tokens.
Unigram (SentencePiece)
The Unigram model works top-down:
- Start with a large seed vocabulary (e.g., all substrings up to a given length that appear in the training data).
- For each token, compute the decrease in corpus log-likelihood if the token were removed.
- Remove the tokens with the smallest impact, keeping a target vocabulary size.
- Repeat until convergence.
At inference time, the Unigram model considers all possible segmentations of a word and selects the one with the highest probability. This probabilistic formulation allows subword regularization — sampling different segmentations during training as a form of data augmentation, which has been shown to improve robustness.[4]
SentencePiece (Kudo and Richardson, 2018) is the library that implements both BPE and Unigram in a language-agnostic way, treating the input as a raw byte stream with no language-specific pre-tokenization. It is used by T5, ALBERT, LLaMA, and many multilingual models.[6]
Practical considerations
Vocabulary size
Vocabulary size represents a fundamental tradeoff:
| Vocabulary size | Pros | Cons |
|---|---|---|
| Small (e.g., 8,000) | Smaller embedding matrix, better learning of rare tokens | Longer sequences, slower inference, loss of semantic chunking |
| Medium (e.g., 32,000–50,000) | Balance of sequence length and coverage | Standard choice for most modern LLMs |
| Large (e.g., 100,000–256,000) | Shorter sequences, more whole-word tokens | Larger embedding matrix, rare tokens may be poorly learned |
Typical vocabulary sizes: GPT-2 used 50,257 tokens; GPT-4 uses approximately 100,256 (cl100k_base); LLaMA 1 used 32,000; LLaMA 3 uses 128,256.
Multilingual challenges
Subword tokenizers trained predominantly on English text exhibit a fertility gap — the same concept expressed in English might require 2–5× more tokens in another language. For example, the same sentence in English might take 10 tokens but 30–40 tokens in Burmese or Khmer. This has direct cost and performance implications:
- Users of token-priced APIs pay more for non-English text.
- Models have less effective context window for non-English languages.
- Languages with non-Latin scripts tend to be fragmented to near-character level, losing the semantic benefits of subword chunking.
Multilingual models like mT5 and BLOOM addressed this with larger vocabularies (250,000+) and more balanced training corpora, but the problem remains an active area of research.
Tokenization artefacts
Tokenization can introduce subtle artefacts that affect model behaviour:
- Trailing whitespace — many tokenizers include leading spaces as part of tokens (" the" vs "the"), meaning the same word has different token representations depending on position.
- Number representation — integers are often split inconsistently (e.g., "1000" might become "100" + "0" or "1" + "000"), which partly explains why LLMs struggle with arithmetic.
- Code and markup — programming languages and structured formats often tokenize inefficiently, with common syntax patterns split across multiple tokens.
Andrej Karpathy's 2024 video "Let's build the GPT tokenizer" demonstrated many of these issues and became a widely viewed introduction to the practical implications of tokenization.[7]
tiktoken and modern implementations
OpenAI's tiktoken library (2022) is a high-performance BPE implementation written in Rust with Python bindings, capable of encoding text approximately 3–6× faster than the equivalent Python-native implementation. It defines the encoding schemes used by OpenAI's models:
- gpt2 — 50,257 tokens, used by GPT-2
- r50k_base — 50,257 tokens, used by early GPT-3 models
- p50k_base — 50,281 tokens, used by Codex models
- cl100k_base — 100,256 tokens, used by GPT-3.5-turbo and GPT-4
- o200k_base — 200,019 tokens, used by GPT-4o
Byte-level approaches and tokenization-free models
A growing line of research explores bypassing learned tokenization entirely:
- ByT5 (Xue et al., 2022) operates directly on UTF-8 bytes, eliminating the tokenization step entirely. It achieves competitive results on many tasks despite the longer sequence lengths.
- MegaByte (Yu et al., 2023) uses a hierarchical architecture to handle byte-level input efficiently, patching sequences of bytes into groups processed by a larger model.
- Byte Latent Transformer (Meta, 2024) dynamically groups bytes into variable-length patches based on entropy, allocating more compute to complex regions of text.
These approaches promise to eliminate tokenization artefacts and improve multilingual fairness, but as of 2025 they have not yet displaced BPE in production frontier models.
Impact on LLM behaviour
Tokenization is often called the "ugly stepchild" of NLP because it is a deterministic, non-learned preprocessing step in an otherwise end-to-end learned pipeline. Yet it has outsized effects:
- Arithmetic ability — the inconsistent tokenization of digits is a major contributor to LLMs' difficulty with multi-digit arithmetic.
- Spelling and character-level tasks — because the model never "sees" individual characters for common words, tasks like counting letters or reversing strings are difficult.
- Context efficiency — a model with a 128,000-token context window has a variable effective context in terms of text length, depending on the density of the language and script being used.
- Prompt engineering — understanding how text maps to tokens is essential for optimising prompts for cost and performance.
See also
- Natural language processing
- Large language model
- BERT
- Word embedding
- Attention (machine learning)
- Andrej Karpathy
References
- ↑ Zhang, Xiang; Zhao, Junbo; LeCun, Yann (2015). "Character-level Convolutional Networks for Text Classification." Advances in Neural Information Processing Systems 28.
- ↑ Sennrich, Rico; Haddow, Barry; Birch, Alexandra (2016). "Neural Machine Translation of Rare Words with Subword Units." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1715–1725.
- ↑ Schuster, Mike; Nakajima, Kaisuke (2012). "Japanese and Korean voice search." IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- ↑ 4.0 4.1 Kudo, Taku (2018). "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 66–75.
- ↑ Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilya (2019). "Language Models are Unsupervised Multitask Learners." OpenAI technical report.
- ↑ Kudo, Taku; Richardson, John (2018). "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing." Proceedings of EMNLP 2018, pp. 66–71.
- ↑ Karpathy, Andrej (2024). "Let's build the GPT Tokenizer." YouTube.