Tokenization
Overview
Tokenization is the foundational process in natural language processing that converts raw text into discrete units (tokens) that machine learning models can process. This seemingly simple transformation profoundly impacts model architecture, performance, efficiency, and capabilities. Tokenization determines the “atoms” of meaning that models learn to compose, directly influencing vocabulary size, sequence length, memory footprint, computational requirements, and cross-lingual effectiveness.
The field has evolved from naive character-level and word-level approaches to sophisticated subword algorithms that balance vocabulary size, semantic granularity, and compositional understanding. Modern large language models (GPT, BERT, T5) rely on subword tokenization—particularly Byte Pair Encoding (BPE), WordPiece, and Unigram—enabling them to handle rare words, morphological variations, and multilingual text efficiently.
Key technical aspects covered:
- Evolution from character to word to subword tokenization
- BPE algorithm mechanics and implementation
- WordPiece, Unigram, and SentencePiece comparisons
- Impact on model performance and efficiency
- Special tokens, padding, and masking
- Multilingual considerations and fairness
- Out-of-vocabulary handling strategies
- Implementation optimizations (tiktoken)
Evolution of Tokenization Approaches
Tokenization strategies have progressed through distinct paradigms, each addressing limitations of predecessors:
Character-Level Tokenization treats each character as atomic unit:
"hello" → ['h', 'e', 'l', 'l', 'o'] = 5 tokens
"你好" → ['你', '好'] = 2 tokens
Advantages:
- Tiny vocabulary (~256 for ASCII, ~150K for Unicode)
- Handles any text including typos and rare words
- Language-agnostic by design
- No out-of-vocabulary (OOV) problem
Disadvantages:
- Very long sequences (5-10x longer than word-level)
- Difficult to learn semantic relationships
- High computational cost (more positions to process)
- Character patterns don’t align with linguistic units
Word-Level Tokenization splits text at word boundaries:
"The quick brown fox" → ['The', 'quick', 'brown', 'fox'] = 4 tokens
"don't" → ['don', "'", 't'] or ['do', "n't"] = 2-3 tokens
Advantages:
- Semantic units preserved
- Shorter sequences than character-level
- Aligns with linguistic intuition
- Straightforward implementation
Disadvantages:
- Massive vocabulary (100K-1M+ words)
- Cannot handle OOV words (typos, names, compounds)
- No morphological understanding (“happy” vs “unhappy” are unrelated)
- Language-specific (requires language-dependent word boundary rules)
- Inflectional languages explode vocabulary
Subword Tokenization decomposes words into meaningful fragments:
"unhappiness" → ['un', 'happiness'] = 2 tokens
"ChatGPT" → ['Chat', 'G', 'PT'] = 3 tokens
"hippopotamus" → ['hip', 'pop', 'otamus'] = 3 tokens
Advantages:
- Balanced vocabulary (30K-200K tokens)
- Handles rare/OOV words compositionally
- Captures morphology (prefixes, suffixes, roots)
- Language-agnostic (works on byte sequences)
- Efficient sequence lengths
Disadvantages:
- Requires training on corpus
- Segmentation not always linguistically meaningful
- Ambiguity in segmentation choices
The shift to subword tokenization emerged from recognition that the optimal “atom” of meaning sits between characters and full words, enabling models to generalize while maintaining efficiency.
Byte Pair Encoding (BPE) Algorithm
BPE adapted from data compression (Philip Gage, 1994) to NLP tokenization represents the dominant approach in modern LLMs:
Algorithm mechanics:
Phase 1: Training (vocabulary construction)
- Initialization: Start with base vocabulary of all unique bytes (characters)
corpus = ["low", "lower", "newest", "widest"]
vocab = {'l', 'o', 'w', 'e', 'r', 'n', 's', 't'}- Iterative merging:
Iteration 1: Count adjacent pairs
"lo" appears 2 times (low, lower)
"ow" appears 2 times (low, lower)
"we" appears 2 times (newer, widest)
Most frequent: "lo" → merge into "lo"
Iteration 2: Update corpus, repeat
"low" becomes "lo w"
"lower" becomes "lo wer"
Count new pairs...
- Termination: Stop after N merges (vocabulary size = base + N)
Phase 2: Encoding (applying learned vocabulary)
def encode_bpe(text, merges, vocab):
# Split into characters
tokens = list(text)
# Apply merges in learned order
for merge in merges:
pair = merge[0], merge[1]
new_token = ''.join(pair)
# Find and replace all occurrences
i = 0
while i < len(tokens) - 1:
if (tokens[i], tokens[i+1]) == pair:
tokens[i:i+2] = [new_token]
else:
i += 1
return tokensKey insight: BPE discovers frequent subword patterns from data without linguistic rules. Algorithm has no concept of “morphemes” or “roots” but rediscovers them through frequency analysis.
Implementation considerations:
Efficient data structures: Use dictionaries/hash maps for O(1) pair lookups. Priority queues for O(log n) merge selection.
Special token handling: Add special tokens (<|endoftext|>, <|pad|>) to initial vocabulary, protecting from merging during training.
Word boundaries: Mark word boundaries (e.g., “Ġhello” where Ġ represents space) ensuring “hello” in “hello world” tokenizes differently than “unhello”.
Merge rules storage: Store ordered list of merge operations applied during encoding. Critical for reproducibility—different merge orders produce different tokenizations.
Example BPE training output:
Base vocab: {a, b, c, ..., z}
Merge 1: "th" (freq: 10,532)
Merge 2: "e_" (word-final e, freq: 8,921)
Merge 3: "in" (freq: 8,103)
...
Merge 50,000: "undivorced" (freq: 2)
Final vocab size: 26 + 50,000 = 50,026 tokens
WordPiece, Unigram, and Algorithm Comparison
WordPiece (Google, 2016) modifies BPE merge criterion:
Merge criterion: Instead of frequency, maximizes likelihood of training data:
score(pair) = P(merged_token) / (P(token1) * P(token2))
Select merge that increases corpus probability most. This likelihood-based approach leads to different vocabulary than frequency-based BPE.
Special marker: Uses ## prefix for subwords not starting words:
"unhappiness" → ['un', '##hap', '##pi', '##ness']
This enables unambiguous reconstruction: remove ## markers and concatenate.
Use cases: BERT, DistilBERT, Electra use WordPiece for its likelihood optimization benefiting masked language modeling.
Unigram Language Model (Kudo, 2018) inverts BPE approach:
Training process:
- Initialize: Start with large vocabulary (all words + common substrings)
- Probabilistic model: Assign probability to each token based on corpus frequency
- Pruning: Iteratively remove tokens minimally affecting overall likelihood
- Termination: Stop at target vocabulary size
Encoding process: Unlike BPE’s greedy longest-match, Unigram finds tokenization with highest probability:
def encode_unigram(text, token_probs):
# Dynamic programming to find most probable segmentation
best_segmentations = [None] * (len(text) + 1)
best_segmentations[0] = ([], 0.0) # (tokens, log_prob)
for i in range(len(text)):
for j in range(i + 1, len(text) + 1):
subword = text[i:j]
if subword in token_probs:
prob = best_segmentations[i][1] + token_probs[subword]
if best_segmentations[j] is None or prob > best_segmentations[j][1]:
best_segmentations[j] = (
best_segmentations[i][0] + [subword],
prob
)
return best_segmentations[len(text)][0]Multiple valid segmentations: Enables sampling during training for regularization (BPE-dropout equivalent).
Comparison summary:
| Aspect | BPE | WordPiece | Unigram |
|---|---|---|---|
| Training direction | Bottom-up (merge) | Bottom-up (merge) | Top-down (prune) |
| Merge criterion | Frequency | Likelihood | Marginal likelihood |
| Encoding | Greedy longest-match | Greedy longest-match | Probabilistic best-path |
| Determinism | Deterministic | Deterministic | Can sample multiple paths |
| Special markers | None (or custom) | ## for non-initial | None |
| Complexity | O(n²) training | O(n²) training | O(n³) encoding (DP) |
| Used by | GPT-2, GPT-3, RoBERTa | BERT, DistilBERT | XLNet, ALBERT, T5 |
SentencePiece (Google, 2018) is not a tokenization algorithm but a unified framework:
Key innovations:
- Treats text as raw byte stream including spaces as regular characters
- Language-agnostic: No pre-tokenization (no word boundaries)
- Reversible: Lossless encoding/decoding
- Supports BPE or Unigram as underlying algorithm
Unicode normalization: Applies NFKC by default ensuring semantic equivalents normalize consistently:
'︰' (U+FE30) → '..' (U+002E U+002E)
'fi' (U+FB01 ligature) → 'fi' (U+0066 U+0069)
Custom normalization via TSV files enables domain-specific preprocessing.
Use cases: T5, XLNet, ALBERT, multilingual models where language-agnostic tokenization critical.
Impact on Model Performance and Efficiency
Tokenization directly determines fundamental model characteristics:
Vocabulary size trade-offs:
Small vocabulary (10K-30K):
- ✅ Fewer embedding parameters (10K × 768 = 7.7M parameters)
- ✅ Faster embedding lookups
- ✅ Less overfitting on rare tokens
- ❌ Longer sequences (more tokens per text)
- ❌ Less efficient for common phrases
Large vocabulary (100K-200K):
- ✅ Shorter sequences (fewer tokens per text)
- ✅ More efficient for frequent words/phrases
- ❌ More embedding parameters (200K × 768 = 153.6M parameters)
- ❌ Risk of overfitting on rare tokens
- ❌ Slower training convergence
Empirical sweet spot: 30K-100K for most models. GPT-2: 50K, GPT-3: 50K, GPT-4: ~100K, GPT-4o: ~200K.
Sequence length impact:
Context window economics:
1000-word document:
- Character-level: ~5000 tokens
- Word-level: ~1000 tokens
- BPE: ~1300 tokens
With 8K token context limit:
- Character: fits ~1.6K words
- Word: fits 8K words
- BPE: fits ~6K words
Efficient tokenization maximizes effective context usage. Critical for long-document understanding tasks.
Computational complexity: Transformer attention is O(n²) in sequence length. Reducing tokens from 5000 to 1300 = 14.8x speedup in attention computation.
Multilingual efficiency:
Problem: Word-level tokenization biased toward morphologically simple languages (English). Languages with rich morphology (Turkish, Finnish) or logographic scripts (Chinese, Japanese) suffer:
English "running" = 1 word-level token
Turkish "koşturuluyormuş" (was being made to run) = 1 word-level token
BUT requires massive vocabulary for all inflections
BPE solution:
English "running" = ['run', 'ning']
Turkish "koşturuluyormuş" = ['koş', 'tur', 'ul', 'uyor', 'muş']
Subword tokenization provides more equitable treatment across languages by decomposing morphologically complex forms.
Fairness metric (Rust et al., 2020): Token-per-word ratio across languages:
BPE optimized for English:
- English: 1.3 tokens/word
- Hindi: 3.2 tokens/word
- Chinese: 2.5 tokens/word
Multilingual BPE:
- English: 1.5 tokens/word
- Hindi: 1.8 tokens/word
- Chinese: 1.6 tokens/word
Training tokenizer on balanced multilingual data critical for fairness.
Performance on specific tasks:
Named Entity Recognition: Word-level superior (preserves entity boundaries). Subword segmentation can split entities incorrectly.
Machine Translation: Subword excels (handles rare words through composition).
Sentiment Analysis: Word-level slightly better (sentiment often word-level).
Code Generation: BPE with code-specific training optimal (learns def, class, import as units).
Special Tokens, Padding, and Masking
Special tokens provide metadata and structural information:
Common special tokens:
[PAD]/<|pad|>: Padding for uniform sequence length (typically ID: 0)[CLS]/<|startoftext|>: Sequence start, aggregation point for classification[SEP]/<|endoftext|>: Separator between segments or sequence end[MASK]: Masked token for training (BERT masked language modeling)[UNK]: Unknown/out-of-vocabulary token<|bos|>: Beginning of sequence<|eos|>: End of sequence
Implementation:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Hello world"
tokens = tokenizer.tokenize(text)
# ['hello', 'world']
# Add special tokens
tokens_with_special = ['[CLS]'] + tokens + ['[SEP]']
# ['[CLS]', 'hello', 'world', '[SEP]']
# Convert to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens_with_special)
# [101, 7592, 2088, 102]Padding strategies:
Fixed-length padding (pad to max length):
sequences = [
[101, 7592, 2088, 102], # "hello world"
[101, 2023, 2003, 1037, 3231, 102] # "this is a test"
]
# Pad to max_length=10
padded = [
[101, 7592, 2088, 102, 0, 0, 0, 0, 0, 0],
[101, 2023, 2003, 1037, 3231, 102, 0, 0, 0, 0]
]Dynamic padding (pad to longest in batch):
# Batch 1: max_length=6
batch1 = [
[101, 7592, 2088, 102, 0, 0],
[101, 2023, 2003, 1037, 3231, 102]
]
# Batch 2: max_length=4 (shorter batch)
batch2 = [
[101, 7592, 102, 0],
[101, 2023, 102, 0]
]Dynamic padding reduces wasted computation on padding tokens.
Attention masking:
Purpose: Prevent model from attending to padding tokens:
input_ids = [101, 7592, 2088, 102, 0, 0, 0, 0]
attention_mask = [1, 1, 1, 1, 0, 0, 0, 0]
# ↑ ↑ ↑ ↑ ↑ Padding tokens maskedIn transformer attention computation:
attention_scores = Q @ K.T / sqrt(d_k)
attention_scores = attention_scores.masked_fill(attention_mask == 0, -inf)
attention_probs = softmax(attention_scores)Setting padded positions to -inf ensures they receive zero attention weight after softmax.
Masked language modeling (BERT training):
Masking strategy:
- Select 15% of tokens randomly
- Of selected tokens:
- 80% replaced with
[MASK] - 10% replaced with random token
- 10% left unchanged
- 80% replaced with
# Original: "The quick brown fox"
# Masked: "The [MASK] brown [MASK]"
# [101, 1996, 103, 2829, 103, 102]
# Model predicts original tokens at masked positionsThis forces model to learn bidirectional context representations.
Critical implementation detail: Never allow model to attend to future positions during autoregressive generation. Causal masking prevents information leakage:
# Causal mask (lower triangular)
[[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 0],
[1, 1, 1, 1]]Out-of-Vocabulary Handling and Challenges
OOV problem manifestation:
Word-level tokenization:
Vocabulary: {"the", "cat", "sat", "on", "mat"}
Input: "The cat sat on the rug"
Tokens: ["the", "cat", "sat", "on", "the", "[UNK]"]
↑ OOV
Model learns nothing about “rug” except that it’s unknown. Context lost.
Subword solution:
Input: "The cat sat on the rug"
BPE tokens: ["The", " cat", " sat", " on", " the", " r", "ug"]
Model processes “rug” as composition of “r” + “ug”, leveraging learned subword representations.
Remaining challenges:
Domain shift: Tokenizer trained on general text, deployed on specialized domain (medical, legal):
General corpus: "coronary" → ['cor', 'onary'] (suboptimal)
Medical corpus: "coronary" → ['coronary'] (single token)
Solution: Domain-adaptive tokenization—retrain or fine-tune tokenizer on target domain.
Named entities: Proper nouns often segmented awkwardly:
"Schwarzenegger" → ['Sch', 'war', 'zen', 'eg', 'ger'] (5 tokens)
Each component learned separately, no holistic entity representation.
Solution: Add frequent entities to vocabulary or use character-level fallback for names.
Typographical errors: BPE helps but not perfectly:
"receive" (correct) → ['rece', 'ive']
"recieve" (typo) → ['rec', 'ieve'] or ['reci', 'eve']
Different segmentation = different representation. Model must learn error patterns.
Solution: BPE-dropout (subword regularization)—sample multiple segmentations during training, improving robustness.
Contractions and compounds:
"don't" → ['don', "'", 't'] or ['do', "n't"]
"New York" → ['New', ' York'] (should be single entity?)
"mother-in-law" → ['mother', '-', 'in', '-', 'law']
Tokenization destroys semantic units. Multi-word expressions lost.
Solution: Pre-tokenization rules, phrase mining, or explicit entity recognition before tokenization.
Unicode and emoji:
"😀" → ['<0xF0>', '<0x9F>', '<0x98>', '<0x80>'] (4 bytes)
"❤️" → ['<0xE2>', '<0x9D>', '<0xA4>', '<0xEF>', '<0xB8>', '<0x8F>']
Emoji decomposed to bytes, losing semantic meaning.
Solution: Expand vocabulary with common emoji or normalize Unicode before tokenization.
Implementation Optimizations: tiktoken
tiktoken (OpenAI) achieves 3-6x speedup over pure Python tokenizers through Rust implementation:
Core architecture:
Rust core (tiktoken-rs):
- BPE merge algorithm in Rust
- HashMap-based efficient pair lookups
- Minimal memory allocations
- Thread-safe concurrent processing
Python bindings (PyO3):
# Python interface
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode("Hello world") # Calls RustPyO3 provides zero-copy interop between Python and Rust.
Performance optimizations:
Thread-local regex compilation: Each thread maintains own compiled regex instance, eliminating lock contention:
thread_local! {
static REGEX: Regex = Regex::new(pattern).unwrap();
}Possessive quantifiers: Prevent regex backtracking:
// Instead of: r"(\w+)"
// Use: r"(\w++)" // Possessive - no backtrackingBPE cache: Memoize frequent subword merges:
let mut cache: HashMap<(Vec<u8>, Vec<u8>), Vec<u8>> = HashMap::new();Byte-level processing: Operate on raw bytes avoiding string allocations:
fn encode_bytes(text: &[u8]) -> Vec<usize> {
// Process bytes directly, no UTF-8 validation overhead
}Benchmark results (OpenAI internal):
tiktoken (Rust): 120,000 tokens/sec
pure Python BPE: 20,000 tokens/sec
Speedup: 6x
Educational tools: tiktoken includes transparent Python implementation for learning:
from tiktoken._educational import *
# Step-through visualization of BPE
show_bpe_merges("Hello world", vocab)Rust performance for production, Python clarity for education.
Encoding variants:
| Encoding | Vocab Size | Models | Special Tokens |
|---|---|---|---|
p50k_base | ~50K | GPT-3, Codex | <|endoftext|> |
cl100k_base | ~100K | GPT-4, GPT-3.5-turbo | <|endoftext|>, <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|> |
o200k_base | ~200K | GPT-4o, GPT-4o-mini | <|endoftext|>, <|endofprompt|> |
Larger vocabularies = more efficient tokenization (fewer tokens per text) but more embedding parameters.
Practical usage:
import tiktoken
# Load encoding
enc = tiktoken.get_encoding("cl100k_base")
# Encode
text = "tiktoken is fast!"
tokens = enc.encode(text) # [83, 1609, 5963, 374, 5043, 0]
# Decode
reconstructed = enc.decode(tokens) # "tiktoken is fast!"
# Count tokens (for API cost estimation)
num_tokens = len(enc.encode(long_text))
cost = (num_tokens / 1000) * price_per_1k_tokenstiktoken serves as infrastructure enabling cost-efficient API usage and context management across LangChain, LlamaIndex, and all OpenAI-based applications.