Tokenization

Overview

Tokenization is the foundational process in natural language processing that converts raw text into discrete units (tokens) that machine learning models can process. This seemingly simple transformation profoundly impacts model architecture, performance, efficiency, and capabilities. Tokenization determines the “atoms” of meaning that models learn to compose, directly influencing vocabulary size, sequence length, memory footprint, computational requirements, and cross-lingual effectiveness.

The field has evolved from naive character-level and word-level approaches to sophisticated subword algorithms that balance vocabulary size, semantic granularity, and compositional understanding. Modern large language models (GPT, BERT, T5) rely on subword tokenization—particularly Byte Pair Encoding (BPE), WordPiece, and Unigram—enabling them to handle rare words, morphological variations, and multilingual text efficiently.

Key technical aspects covered:

Evolution from character to word to subword tokenization
BPE algorithm mechanics and implementation
WordPiece, Unigram, and SentencePiece comparisons
Impact on model performance and efficiency
Special tokens, padding, and masking
Multilingual considerations and fairness
Out-of-vocabulary handling strategies
Implementation optimizations (tiktoken)

Evolution of Tokenization Approaches

Tokenization strategies have progressed through distinct paradigms, each addressing limitations of predecessors:

Character-Level Tokenization treats each character as atomic unit:

"hello" → ['h', 'e', 'l', 'l', 'o'] = 5 tokens
"你好" → ['你', '好'] = 2 tokens

Advantages:

Tiny vocabulary (~256 for ASCII, ~150K for Unicode)
Handles any text including typos and rare words
Language-agnostic by design
No out-of-vocabulary (OOV) problem

Disadvantages:

Very long sequences (5-10x longer than word-level)
Difficult to learn semantic relationships
High computational cost (more positions to process)
Character patterns don’t align with linguistic units

Word-Level Tokenization splits text at word boundaries:

"The quick brown fox" → ['The', 'quick', 'brown', 'fox'] = 4 tokens
"don't" → ['don', "'", 't'] or ['do', "n't"] = 2-3 tokens

Advantages:

Semantic units preserved
Shorter sequences than character-level
Aligns with linguistic intuition
Straightforward implementation

Disadvantages:

Massive vocabulary (100K-1M+ words)
Cannot handle OOV words (typos, names, compounds)
No morphological understanding (“happy” vs “unhappy” are unrelated)
Language-specific (requires language-dependent word boundary rules)
Inflectional languages explode vocabulary

Subword Tokenization decomposes words into meaningful fragments:

"unhappiness" → ['un', 'happiness'] = 2 tokens
"ChatGPT" → ['Chat', 'G', 'PT'] = 3 tokens
"hippopotamus" → ['hip', 'pop', 'otamus'] = 3 tokens

Advantages:

Balanced vocabulary (30K-200K tokens)
Handles rare/OOV words compositionally
Captures morphology (prefixes, suffixes, roots)
Language-agnostic (works on byte sequences)
Efficient sequence lengths

Disadvantages:

Requires training on corpus
Segmentation not always linguistically meaningful
Ambiguity in segmentation choices

The shift to subword tokenization emerged from recognition that the optimal “atom” of meaning sits between characters and full words, enabling models to generalize while maintaining efficiency.

Byte Pair Encoding (BPE) Algorithm

BPE adapted from data compression (Philip Gage, 1994) to NLP tokenization represents the dominant approach in modern LLMs:

Algorithm mechanics:

Phase 1: Training (vocabulary construction)

Initialization: Start with base vocabulary of all unique bytes (characters)

corpus = ["low", "lower", "newest", "widest"]
vocab = {'l', 'o', 'w', 'e', 'r', 'n', 's', 't'}

Iterative merging:

Iteration 1: Count adjacent pairs
  "lo" appears 2 times (low, lower)
  "ow" appears 2 times (low, lower)
  "we" appears 2 times (newer, widest)
  Most frequent: "lo" → merge into "lo"

Iteration 2: Update corpus, repeat
  "low" becomes "lo w"
  "lower" becomes "lo wer"
  Count new pairs...

Termination: Stop after N merges (vocabulary size = base + N)

Phase 2: Encoding (applying learned vocabulary)

def encode_bpe(text, merges, vocab):
    # Split into characters
    tokens = list(text)
    
    # Apply merges in learned order
    for merge in merges:
        pair = merge[0], merge[1]
        new_token = ''.join(pair)
        
        # Find and replace all occurrences
        i = 0
        while i < len(tokens) - 1:
            if (tokens[i], tokens[i+1]) == pair:
                tokens[i:i+2] = [new_token]
            else:
                i += 1
    
    return tokens

Key insight: BPE discovers frequent subword patterns from data without linguistic rules. Algorithm has no concept of “morphemes” or “roots” but rediscovers them through frequency analysis.

Implementation considerations:

Efficient data structures: Use dictionaries/hash maps for O(1) pair lookups. Priority queues for O(log n) merge selection.

Special token handling: Add special tokens (<|endoftext|>, <|pad|>) to initial vocabulary, protecting from merging during training.

Word boundaries: Mark word boundaries (e.g., “Ġhello” where Ġ represents space) ensuring “hello” in “hello world” tokenizes differently than “unhello”.

Merge rules storage: Store ordered list of merge operations applied during encoding. Critical for reproducibility—different merge orders produce different tokenizations.

Example BPE training output:

Base vocab: {a, b, c, ..., z}
Merge 1: "th" (freq: 10,532)
Merge 2: "e_" (word-final e, freq: 8,921)  
Merge 3: "in" (freq: 8,103)
...
Merge 50,000: "undivorced" (freq: 2)
Final vocab size: 26 + 50,000 = 50,026 tokens

WordPiece, Unigram, and Algorithm Comparison

WordPiece (Google, 2016) modifies BPE merge criterion:

Merge criterion: Instead of frequency, maximizes likelihood of training data:

score(pair) = P(merged_token) / (P(token1) * P(token2))

Select merge that increases corpus probability most. This likelihood-based approach leads to different vocabulary than frequency-based BPE.

Special marker: Uses ## prefix for subwords not starting words:

"unhappiness" → ['un', '##hap', '##pi', '##ness']

This enables unambiguous reconstruction: remove ## markers and concatenate.

Use cases: BERT, DistilBERT, Electra use WordPiece for its likelihood optimization benefiting masked language modeling.

Unigram Language Model (Kudo, 2018) inverts BPE approach:

Training process:

Initialize: Start with large vocabulary (all words + common substrings)
Probabilistic model: Assign probability to each token based on corpus frequency
Pruning: Iteratively remove tokens minimally affecting overall likelihood
Termination: Stop at target vocabulary size

Encoding process: Unlike BPE’s greedy longest-match, Unigram finds tokenization with highest probability:

def encode_unigram(text, token_probs):
    # Dynamic programming to find most probable segmentation
    best_segmentations = [None] * (len(text) + 1)
    best_segmentations[0] = ([], 0.0)  # (tokens, log_prob)
    
    for i in range(len(text)):
        for j in range(i + 1, len(text) + 1):
            subword = text[i:j]
            if subword in token_probs:
                prob = best_segmentations[i][1] + token_probs[subword]
                if best_segmentations[j] is None or prob > best_segmentations[j][1]:
                    best_segmentations[j] = (
                        best_segmentations[i][0] + [subword],
                        prob
                    )
    
    return best_segmentations[len(text)][0]

Multiple valid segmentations: Enables sampling during training for regularization (BPE-dropout equivalent).

Comparison summary:

Aspect	BPE	WordPiece	Unigram
Training direction	Bottom-up (merge)	Bottom-up (merge)	Top-down (prune)
Merge criterion	Frequency	Likelihood	Marginal likelihood
Encoding	Greedy longest-match	Greedy longest-match	Probabilistic best-path
Determinism	Deterministic	Deterministic	Can sample multiple paths
Special markers	None (or custom)	`##` for non-initial	None
Complexity	O(n²) training	O(n²) training	O(n³) encoding (DP)
Used by	GPT-2, GPT-3, RoBERTa	BERT, DistilBERT	XLNet, ALBERT, T5

SentencePiece (Google, 2018) is not a tokenization algorithm but a unified framework:

Key innovations:

Treats text as raw byte stream including spaces as regular characters
Language-agnostic: No pre-tokenization (no word boundaries)
Reversible: Lossless encoding/decoding
Supports BPE or Unigram as underlying algorithm

Unicode normalization: Applies NFKC by default ensuring semantic equivalents normalize consistently:

'︰' (U+FE30) → '..' (U+002E U+002E)
'ﬁ' (U+FB01 ligature) → 'fi' (U+0066 U+0069)

Custom normalization via TSV files enables domain-specific preprocessing.

Use cases: T5, XLNet, ALBERT, multilingual models where language-agnostic tokenization critical.

Impact on Model Performance and Efficiency

Tokenization directly determines fundamental model characteristics:

Vocabulary size trade-offs:

Small vocabulary (10K-30K):

✅ Fewer embedding parameters (10K × 768 = 7.7M parameters)
✅ Faster embedding lookups
✅ Less overfitting on rare tokens
❌ Longer sequences (more tokens per text)
❌ Less efficient for common phrases

Large vocabulary (100K-200K):

✅ Shorter sequences (fewer tokens per text)
✅ More efficient for frequent words/phrases
❌ More embedding parameters (200K × 768 = 153.6M parameters)
❌ Risk of overfitting on rare tokens
❌ Slower training convergence

Empirical sweet spot: 30K-100K for most models. GPT-2: 50K, GPT-3: 50K, GPT-4: ~100K, GPT-4o: ~200K.

Sequence length impact:

Context window economics:

1000-word document:
- Character-level: ~5000 tokens
- Word-level: ~1000 tokens  
- BPE: ~1300 tokens

With 8K token context limit:
- Character: fits ~1.6K words
- Word: fits 8K words
- BPE: fits ~6K words

Efficient tokenization maximizes effective context usage. Critical for long-document understanding tasks.

Computational complexity: Transformer attention is O(n²) in sequence length. Reducing tokens from 5000 to 1300 = 14.8x speedup in attention computation.

Multilingual efficiency:

Problem: Word-level tokenization biased toward morphologically simple languages (English). Languages with rich morphology (Turkish, Finnish) or logographic scripts (Chinese, Japanese) suffer:

English "running" = 1 word-level token
Turkish "koşturuluyormuş" (was being made to run) = 1 word-level token
  BUT requires massive vocabulary for all inflections

BPE solution:
English "running" = ['run', 'ning']
Turkish "koşturuluyormuş" = ['koş', 'tur', 'ul', 'uyor', 'muş']

Subword tokenization provides more equitable treatment across languages by decomposing morphologically complex forms.

Fairness metric (Rust et al., 2020): Token-per-word ratio across languages:

BPE optimized for English:
- English: 1.3 tokens/word
- Hindi: 3.2 tokens/word  
- Chinese: 2.5 tokens/word

Multilingual BPE:
- English: 1.5 tokens/word
- Hindi: 1.8 tokens/word
- Chinese: 1.6 tokens/word

Training tokenizer on balanced multilingual data critical for fairness.

Performance on specific tasks:

Named Entity Recognition: Word-level superior (preserves entity boundaries). Subword segmentation can split entities incorrectly.

Machine Translation: Subword excels (handles rare words through composition).

Sentiment Analysis: Word-level slightly better (sentiment often word-level).

Code Generation: BPE with code-specific training optimal (learns def, class, import as units).

Special Tokens, Padding, and Masking

Special tokens provide metadata and structural information:

Common special tokens:

[PAD] / <|pad|>: Padding for uniform sequence length (typically ID: 0)
[CLS] / <|startoftext|>: Sequence start, aggregation point for classification
[SEP] / <|endoftext|>: Separator between segments or sequence end
[MASK]: Masked token for training (BERT masked language modeling)
[UNK]: Unknown/out-of-vocabulary token
<|bos|>: Beginning of sequence
<|eos|>: End of sequence

Implementation:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 
text = "Hello world"
tokens = tokenizer.tokenize(text)
# ['hello', 'world']
 
# Add special tokens
tokens_with_special = ['[CLS]'] + tokens + ['[SEP]']
# ['[CLS]', 'hello', 'world', '[SEP]']
 
# Convert to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens_with_special)
# [101, 7592, 2088, 102]

Padding strategies:

Fixed-length padding (pad to max length):

sequences = [
    [101, 7592, 2088, 102],           # "hello world"
    [101, 2023, 2003, 1037, 3231, 102]  # "this is a test"
]
 
# Pad to max_length=10
padded = [
    [101, 7592, 2088, 102, 0, 0, 0, 0, 0, 0],
    [101, 2023, 2003, 1037, 3231, 102, 0, 0, 0, 0]
]

Dynamic padding (pad to longest in batch):

# Batch 1: max_length=6
batch1 = [
    [101, 7592, 2088, 102, 0, 0],
    [101, 2023, 2003, 1037, 3231, 102]
]
 
# Batch 2: max_length=4 (shorter batch)
batch2 = [
    [101, 7592, 102, 0],
    [101, 2023, 102, 0]
]

Dynamic padding reduces wasted computation on padding tokens.

Attention masking:

Purpose: Prevent model from attending to padding tokens:

input_ids = [101, 7592, 2088, 102, 0, 0, 0, 0]
attention_mask = [1, 1, 1, 1, 0, 0, 0, 0]
#                 ↑  ↑  ↑  ↑  ↑ Padding tokens masked

In transformer attention computation:

attention_scores = Q @ K.T / sqrt(d_k)
attention_scores = attention_scores.masked_fill(attention_mask == 0, -inf)
attention_probs = softmax(attention_scores)

Setting padded positions to -inf ensures they receive zero attention weight after softmax.

Masked language modeling (BERT training):

Masking strategy:

Select 15% of tokens randomly
Of selected tokens:
- 80% replaced with [MASK]
- 10% replaced with random token
- 10% left unchanged

# Original: "The quick brown fox"
# Masked:   "The [MASK] brown [MASK]"
#           [101, 1996, 103, 2829, 103, 102]
 
# Model predicts original tokens at masked positions

This forces model to learn bidirectional context representations.

Critical implementation detail: Never allow model to attend to future positions during autoregressive generation. Causal masking prevents information leakage:

# Causal mask (lower triangular)
[[1, 0, 0, 0],
 [1, 1, 0, 0],
 [1, 1, 1, 0],
 [1, 1, 1, 1]]

Out-of-Vocabulary Handling and Challenges

OOV problem manifestation:

Word-level tokenization:

Vocabulary: {"the", "cat", "sat", "on", "mat"}
Input: "The cat sat on the rug"
Tokens: ["the", "cat", "sat", "on", "the", "[UNK]"]
                                            ↑ OOV

Model learns nothing about “rug” except that it’s unknown. Context lost.

Subword solution:

Input: "The cat sat on the rug"  
BPE tokens: ["The", " cat", " sat", " on", " the", " r", "ug"]

Model processes “rug” as composition of “r” + “ug”, leveraging learned subword representations.

Remaining challenges:

Domain shift: Tokenizer trained on general text, deployed on specialized domain (medical, legal):

General corpus: "coronary" → ['cor', 'onary'] (suboptimal)
Medical corpus: "coronary" → ['coronary'] (single token)

Solution: Domain-adaptive tokenization—retrain or fine-tune tokenizer on target domain.

Named entities: Proper nouns often segmented awkwardly:

"Schwarzenegger" → ['Sch', 'war', 'zen', 'eg', 'ger'] (5 tokens)

Each component learned separately, no holistic entity representation.

Solution: Add frequent entities to vocabulary or use character-level fallback for names.

Typographical errors: BPE helps but not perfectly:

"receive" (correct) → ['rece', 'ive']
"recieve" (typo) → ['rec', 'ieve'] or ['reci', 'eve']

Different segmentation = different representation. Model must learn error patterns.

Solution: BPE-dropout (subword regularization)—sample multiple segmentations during training, improving robustness.

Contractions and compounds:

"don't" → ['don', "'", 't'] or ['do', "n't"]
"New York" → ['New', ' York'] (should be single entity?)
"mother-in-law" → ['mother', '-', 'in', '-', 'law']

Tokenization destroys semantic units. Multi-word expressions lost.

Solution: Pre-tokenization rules, phrase mining, or explicit entity recognition before tokenization.

Unicode and emoji:

"😀" → ['<0xF0>', '<0x9F>', '<0x98>', '<0x80>'] (4 bytes)
"❤️" → ['<0xE2>', '<0x9D>', '<0xA4>', '<0xEF>', '<0xB8>', '<0x8F>']

Emoji decomposed to bytes, losing semantic meaning.

Solution: Expand vocabulary with common emoji or normalize Unicode before tokenization.

Implementation Optimizations: tiktoken

tiktoken (OpenAI) achieves 3-6x speedup over pure Python tokenizers through Rust implementation:

Core architecture:

Rust core (tiktoken-rs):

BPE merge algorithm in Rust
HashMap-based efficient pair lookups
Minimal memory allocations
Thread-safe concurrent processing

Python bindings (PyO3):

# Python interface
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode("Hello world")  # Calls Rust

PyO3 provides zero-copy interop between Python and Rust.

Performance optimizations:

Thread-local regex compilation: Each thread maintains own compiled regex instance, eliminating lock contention:

thread_local! {
    static REGEX: Regex = Regex::new(pattern).unwrap();
}

Possessive quantifiers: Prevent regex backtracking:

// Instead of: r"(\w+)"
// Use: r"(\w++)"  // Possessive - no backtracking

BPE cache: Memoize frequent subword merges:

let mut cache: HashMap<(Vec<u8>, Vec<u8>), Vec<u8>> = HashMap::new();

Byte-level processing: Operate on raw bytes avoiding string allocations:

fn encode_bytes(text: &[u8]) -> Vec<usize> {
    // Process bytes directly, no UTF-8 validation overhead
}

Benchmark results (OpenAI internal):

tiktoken (Rust):     120,000 tokens/sec
pure Python BPE:      20,000 tokens/sec
Speedup:              6x

Educational tools: tiktoken includes transparent Python implementation for learning:

from tiktoken._educational import *
 
# Step-through visualization of BPE
show_bpe_merges("Hello world", vocab)

Rust performance for production, Python clarity for education.

Encoding variants:

Encoding	Vocab Size	Models	Special Tokens
`p50k_base`	~50K	GPT-3, Codex	`<\|endoftext\|>`
`cl100k_base`	~100K	GPT-4, GPT-3.5-turbo	`<\|endoftext\|>`, `<\|fim_prefix\|>`, `<\|fim_middle\|>`, `<\|fim_suffix\|>`
`o200k_base`	~200K	GPT-4o, GPT-4o-mini	`<\|endoftext\|>`, `<\|endofprompt\|>`

Larger vocabularies = more efficient tokenization (fewer tokens per text) but more embedding parameters.

Practical usage:

import tiktoken
 
# Load encoding
enc = tiktoken.get_encoding("cl100k_base")
 
# Encode
text = "tiktoken is fast!"
tokens = enc.encode(text)  # [83, 1609, 5963, 374, 5043, 0]
 
# Decode
reconstructed = enc.decode(tokens)  # "tiktoken is fast!"
 
# Count tokens (for API cost estimation)
num_tokens = len(enc.encode(long_text))
cost = (num_tokens / 1000) * price_per_1k_tokens

tiktoken serves as infrastructure enabling cost-efficient API usage and context management across LangChain, LlamaIndex, and all OpenAI-based applications.

Edmondo's Vault

Explorer

Tokenization

Tokenization

Overview

Evolution of Tokenization Approaches

Byte Pair Encoding (BPE) Algorithm

WordPiece, Unigram, and Algorithm Comparison

Impact on Model Performance and Efficiency

Special Tokens, Padding, and Masking

Out-of-Vocabulary Handling and Challenges

Implementation Optimizations: tiktoken

Graph View

Table of Contents

Backlinks