Subword tokenization is a text segmentation method that decomposes words into smaller, frequently occurring linguistic units called subwords or wordpieces. This approach enables a language model to handle a vast and open vocabulary—including complex compounds, morphological variants, and out-of-vocabulary words—using a fixed, manageable set of tokens. It is fundamental to modern transformer-based models like BERT and GPT, providing a balance between the granularity of characters and the semantic coherence of whole words.
Glossary
Subword Tokenization

What is Subword Tokenization?
Subword tokenization is a core natural language processing technique that bridges the gap between word-level and character-level text representation.
Common algorithms include Byte-Pair Encoding (BPE), WordPiece, and Unigram Language Model, each learning a vocabulary from a corpus by iteratively merging frequent character sequences. For Tiny Language Models deployed on microcontrollers, subword tokenization is critical. It allows a compact model to efficiently represent language with a small embedding table, directly reducing memory footprint—a key constraint in TinyML deployment. This method ensures the model can process novel terms without requiring a prohibitively large vocabulary.
Core Subword Tokenization Algorithms
Subword tokenization bridges the gap between word-level and character-level processing. These core algorithms determine how text is decomposed into a finite, manageable set of tokens for language models.
Byte-Pair Encoding (BPE)
Byte-Pair Encoding (BPE) is a data compression algorithm adapted for tokenization. It starts with a base vocabulary of individual characters and iteratively merges the most frequent adjacent symbol pairs into new subword units.
- Process: Begins with characters, counts symbol pair frequencies, and merges the most common pair. Repeats for a set number of merges.
- Key Feature: Creates a vocabulary of variable-length subwords. Common words become single tokens (e.g.,
"the"), while rare words are split (e.g.,"tokenization"→"token","ization"). - Use Case: The original algorithm behind OpenAI's GPT models and many other early large language models.
WordPiece
WordPiece is a subword tokenization algorithm used by models like BERT. It is similar to BPE but uses a different, likelihood-based merging criterion.
- Process: Starts with a base vocabulary of characters. Instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data when added to the vocabulary.
- Key Feature: Tends to produce a more linguistically plausible segmentation than pure frequency-based BPE. It uses a
##prefix to denote subwords that are not at the beginning of a word. - Use Case: The standard tokenizer for the BERT family of models and their derivatives.
Unigram Language Model
The Unigram Language Model tokenization algorithm starts with a large seed vocabulary and iteratively prunes it down, optimizing for overall corpus likelihood.
- Process: Assumes each word's tokenization is independent (a unigram assumption). Begins with a large overcomplete vocabulary (e.g., all frequent substrings) and uses the Expectation-Maximization algorithm to estimate subword probabilities. The vocabulary is shrunk by removing the least impactful subwords.
- Key Feature: Can output multiple possible segmentations with probabilities for a single word, allowing for sampling-based segmentation. It is inherently probabilistic.
- Use Case: Used by the SentencePiece tool and models like ALBERT and T5.
Byte-Level BPE (BBPE)
Byte-Level BPE (BBPE) is a variant of BPE that operates directly on UTF-8 encoded bytes, rather than Unicode characters. This creates a truly universal, small vocabulary.
- Process: Text is encoded into a sequence of bytes (256 possible values). The standard BPE merging process is applied to these byte sequences.
- Key Feature: The vocabulary size is capped at 256 + merges, guaranteeing a small, fixed base. It can represent any text without an
<UNK>token, as out-of-vocabulary words are simply decomposed into many byte tokens. - Trade-off: Extremely small vocabulary is efficient, but longer sequences for complex characters or rare words can increase context window usage.
- Use Case: Used by OpenAI's GPT-2 and RoBERTa to handle diverse text and emojis robustly.
Algorithm Comparison & Trade-offs
Choosing a subword algorithm involves balancing vocabulary size, token sequence length, and linguistic coherence.
- Vocabulary Size vs. Sequence Length: A larger vocabulary leads to shorter, more efficient token sequences but risks overfitting to the training corpus. A tiny vocabulary (like BBPE) guarantees coverage but produces long sequences.
- Linguistic Coherence: BPE/WordPiece merges are greedy and can produce non-intuitive splits. The Unigram model's probabilistic approach can be more flexible.
- TinyML Consideration: For microcontroller deployment, a small, fixed vocabulary is critical. Techniques like vocabulary pruning (removing rare tokens) or using BBPE are essential to shrink the embedding matrix, which is often a model's largest layer.
How Subword Tokenization Works
Subword tokenization is a core text processing method for language models, especially critical for deployment on resource-constrained hardware.
Subword tokenization is a text segmentation method that breaks words into smaller, frequently occurring units called subwords or wordpieces. This approach enables a model to handle a vast vocabulary and out-of-vocabulary words with a fixed, manageable set of tokens, which is essential for memory-constrained tiny language models. Instead of a word-level vocabulary that grows unbounded, algorithms like Byte-Pair Encoding (BPE) or Unigram Language Model statistically learn a subword vocabulary from a training corpus, balancing granularity and token count.
For tiny machine learning deployment, this method provides significant efficiency gains. A compact subword vocabulary drastically reduces the size of the model's embedding layer, a major memory consumer. It also improves generalization by allowing the model to construct and understand novel words from known subword components. Libraries like SentencePiece implement these algorithms in a language-agnostic way, processing raw text as a sequence of Unicode characters to support multilingual models on edge devices.
Subword vs. Word vs. Character Tokenization
A technical comparison of three fundamental text tokenization strategies, highlighting their trade-offs in vocabulary size, out-of-vocabulary handling, sequence length, and suitability for Tiny Language Models (TinyLMs) and microcontroller deployment.
| Feature / Metric | Word Tokenization | Subword Tokenization | Character Tokenization |
|---|---|---|---|
Core Unit | Entire words (e.g., 'running') | Frequent subword units (e.g., 'run', '##ning') | Individual characters (e.g., 'r', 'u', 'n') |
Vocabulary Size | Very Large (50k-200k+) | Fixed & Manageable (e.g., 30k) | Tiny (< 1k) |
Out-of-Vocabulary (OOV) Handling | ❌ Poor (requires fallback) | ✅ Excellent (via subword composition) | ✅ Perfect (no OOV) |
Typical Sequence Length | Shortest | Moderate | Longest (5-10x word count) |
Semantic Richness per Token | Highest | High | Lowest |
Model Embedding Layer Size | Largest (major memory cost) | Controlled | Smallest |
Suitability for Tiny Language Models | ❌ Poor (memory prohibitive) | ✅ Excellent (balanced efficiency) | ⚠️ Possible (but long sequences) |
Common Algorithms / Examples | Whitespace splitting, SpaCy | Byte-Pair Encoding (BPE), WordPiece, SentencePiece | Direct character mapping, UTF-8 encoding |
Frequently Asked Questions
Subword tokenization is a core text processing technique for modern language models, especially critical for deployment on memory-constrained devices. It balances the efficiency of word-level tokenization with the flexibility of character-level processing.
Subword tokenization is a text segmentation method that breaks words into smaller, frequently occurring units called subwords or wordpieces. It works by first training a tokenizer on a large corpus to learn a fixed-size vocabulary of the most common character sequences (subwords). During inference, the algorithm greedily decomposes unknown words into known subwords from this vocabulary, allowing the model to handle a vast and open vocabulary with a limited token set. For example, the word "unhappiness" might be tokenized as ["un", "happ", "iness"] if those subwords are in the learned vocabulary.
This process sits between word-level tokenization (which suffers from large vocabularies and out-of-vocabulary words) and character-level tokenization (which is computationally inefficient and loses semantic meaning). Popular algorithms include Byte-Pair Encoding (BPE), WordPiece, and Unigram Language Model tokenization, each with slight variations in how the vocabulary is constructed and segmentation is performed.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Subword tokenization is a foundational component of modern NLP pipelines, enabling efficient vocabulary management. These related concepts detail the specific algorithms, libraries, and compression techniques that interact with or build upon subword methods, especially in resource-constrained TinyML environments.
Byte-Pair Encoding (BPE)
Byte-Pair Encoding (BPE) is a data compression algorithm adapted for subword tokenization. It starts with a base vocabulary of individual characters and iteratively merges the most frequent adjacent symbol pairs into new subword units.
- Process: Begins with character-level tokens, then performs merge operations based on frequency in the training corpus.
- Key Feature: Creates a vocabulary of variable-length subwords, balancing character-level flexibility with word-level efficiency.
- Usage: The original algorithm behind OpenAI's GPT model tokenizers and a core option in the SentencePiece library.
WordPiece
WordPiece is a subword tokenization algorithm similar to BPE, used prominently in Google's BERT family of models. Instead of merging the most frequent pairs, it merges pairs that maximize the likelihood of the training data when added to the vocabulary.
- Process: Employs a greedy likelihood maximization strategy for merges.
- Distinction: Often produces subwords that are more linguistically plausible prefixes, stems, and suffixes compared to pure frequency-based BPE.
- Handling Unknowns: Uses a
[UNK]token for out-of-vocabulary sequences that cannot be broken down.
Unigram Language Model Tokenization
Unigram Language Model Tokenization is a probabilistic subword method that starts with a large vocabulary and iteratively removes the least valuable subwords to shrink to a target size.
- Process: Begins with all possible substrings (or a large seed vocabulary) and uses a unigram language model to score the likelihood of the corpus under different vocabularies.
- Key Feature: Provides a probability for each subword, allowing for multiple possible tokenizations of a single word, from which the most probable can be selected.
- Advantage: Offers a principled probabilistic framework for vocabulary optimization, as implemented in SentencePiece.
Vocabulary Pruning
Vocabulary Pruning is a model compression technique that reduces the size of a language model's embedding layer by removing rarely used tokens from its vocabulary.
- Mechanism: Tokens with the lowest frequency in a target domain or task dataset are identified and eliminated.
- Impact: Directly shrinks the model's parameter count (the embedding matrix is often one of the largest components) and memory footprint.
- TinyML Synergy: Often applied after subword tokenization to create an ultra-compact, domain-specific vocabulary for microcontroller deployment, trading broad linguistic coverage for extreme efficiency.
Token Embeddings
Token Embeddings are dense, continuous vector representations of tokens (words or subwords) learned by a language model. They map discrete token IDs from the vocabulary into a high-dimensional semantic space.
- Function: Form the first layer of a transformer model, converting tokenized input into a numerical format the network can process.
- Subword Connection: Subword tokenization creates a manageable set of tokens, each with its own learned embedding. This allows the model to construct representations for unseen words by combining the embeddings of their constituent subwords.
- Compression Target: The embedding lookup table is a primary target for quantization and pruning in TinyML to reduce its memory and computational cost.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us