Glossary

Subword Tokenization

Subword tokenization is a text processing method that decomposes words into smaller, frequently occurring units to handle large vocabularies and out-of-vocabulary words efficiently.

Get in touch Learn more

Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.

NLP PREPROCESSING

What is Subword Tokenization?

Subword tokenization is a core natural language processing technique that bridges the gap between word-level and character-level text representation.

Subword tokenization is a text segmentation method that decomposes words into smaller, frequently occurring linguistic units called subwords or wordpieces. This approach enables a language model to handle a vast and open vocabulary—including complex compounds, morphological variants, and out-of-vocabulary words—using a fixed, manageable set of tokens. It is fundamental to modern transformer-based models like BERT and GPT, providing a balance between the granularity of characters and the semantic coherence of whole words.

Common algorithms include Byte-Pair Encoding (BPE), WordPiece, and Unigram Language Model, each learning a vocabulary from a corpus by iteratively merging frequent character sequences. For Tiny Language Models deployed on microcontrollers, subword tokenization is critical. It allows a compact model to efficiently represent language with a small embedding table, directly reducing memory footprint—a key constraint in TinyML deployment. This method ensures the model can process novel terms without requiring a prohibitively large vocabulary.

FOUNDATIONAL METHODS

Core Subword Tokenization Algorithms

Subword tokenization bridges the gap between word-level and character-level processing. These core algorithms determine how text is decomposed into a finite, manageable set of tokens for language models.

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE) is a data compression algorithm adapted for tokenization. It starts with a base vocabulary of individual characters and iteratively merges the most frequent adjacent symbol pairs into new subword units.

Process: Begins with characters, counts symbol pair frequencies, and merges the most common pair. Repeats for a set number of merges.
Key Feature: Creates a vocabulary of variable-length subwords. Common words become single tokens (e.g., "the"), while rare words are split (e.g., "tokenization" → "token", "ization").
Use Case: The original algorithm behind OpenAI's GPT models and many other early large language models.

WordPiece

WordPiece is a subword tokenization algorithm used by models like BERT. It is similar to BPE but uses a different, likelihood-based merging criterion.

Process: Starts with a base vocabulary of characters. Instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data when added to the vocabulary.
Key Feature: Tends to produce a more linguistically plausible segmentation than pure frequency-based BPE. It uses a ## prefix to denote subwords that are not at the beginning of a word.
Use Case: The standard tokenizer for the BERT family of models and their derivatives.

Unigram Language Model

The Unigram Language Model tokenization algorithm starts with a large seed vocabulary and iteratively prunes it down, optimizing for overall corpus likelihood.

Process: Assumes each word's tokenization is independent (a unigram assumption). Begins with a large overcomplete vocabulary (e.g., all frequent substrings) and uses the Expectation-Maximization algorithm to estimate subword probabilities. The vocabulary is shrunk by removing the least impactful subwords.
Key Feature: Can output multiple possible segmentations with probabilities for a single word, allowing for sampling-based segmentation. It is inherently probabilistic.
Use Case: Used by the SentencePiece tool and models like ALBERT and T5.

SentencePiece

SentencePiece is not an algorithm itself but a language-agnostic, open-source implementation and toolkit for subword tokenization. It provides a unified API for BPE and Unigram algorithms.

Key Feature: Treats input text as a raw sequence of Unicode characters, eliminating the need for pre-tokenization (whitespace splitting) or language-specific logic. This makes it ideal for languages without clear word boundaries (e.g., Chinese, Japanese).
Process: Directly trains on raw sentences, applying the chosen algorithm (Unigram is default) to the character sequence.
Use Case: The de facto standard tool for training custom tokenizers for multilingual models and research, used by models like Llama and Mistral.

EXPLORE

Byte-Level BPE (BBPE)

Byte-Level BPE (BBPE) is a variant of BPE that operates directly on UTF-8 encoded bytes, rather than Unicode characters. This creates a truly universal, small vocabulary.

Process: Text is encoded into a sequence of bytes (256 possible values). The standard BPE merging process is applied to these byte sequences.
Key Feature: The vocabulary size is capped at 256 + merges, guaranteeing a small, fixed base. It can represent any text without an <UNK> token, as out-of-vocabulary words are simply decomposed into many byte tokens.
Trade-off: Extremely small vocabulary is efficient, but longer sequences for complex characters or rare words can increase context window usage.
Use Case: Used by OpenAI's GPT-2 and RoBERTa to handle diverse text and emojis robustly.

Algorithm Comparison & Trade-offs

Choosing a subword algorithm involves balancing vocabulary size, token sequence length, and linguistic coherence.

Vocabulary Size vs. Sequence Length: A larger vocabulary leads to shorter, more efficient token sequences but risks overfitting to the training corpus. A tiny vocabulary (like BBPE) guarantees coverage but produces long sequences.
Linguistic Coherence: BPE/WordPiece merges are greedy and can produce non-intuitive splits. The Unigram model's probabilistic approach can be more flexible.
TinyML Consideration: For microcontroller deployment, a small, fixed vocabulary is critical. Techniques like vocabulary pruning (removing rare tokens) or using BBPE are essential to shrink the embedding matrix, which is often a model's largest layer.

TINY LANGUAGE MODELS

How Subword Tokenization Works

Subword tokenization is a core text processing method for language models, especially critical for deployment on resource-constrained hardware.

Subword tokenization is a text segmentation method that breaks words into smaller, frequently occurring units called subwords or wordpieces. This approach enables a model to handle a vast vocabulary and out-of-vocabulary words with a fixed, manageable set of tokens, which is essential for memory-constrained tiny language models. Instead of a word-level vocabulary that grows unbounded, algorithms like Byte-Pair Encoding (BPE) or Unigram Language Model statistically learn a subword vocabulary from a training corpus, balancing granularity and token count.

For tiny machine learning deployment, this method provides significant efficiency gains. A compact subword vocabulary drastically reduces the size of the model's embedding layer, a major memory consumer. It also improves generalization by allowing the model to construct and understand novel words from known subword components. Libraries like SentencePiece implement these algorithms in a language-agnostic way, processing raw text as a sequence of Unicode characters to support multilingual models on edge devices.

COMPARISON

Subword vs. Word vs. Character Tokenization

A technical comparison of three fundamental text tokenization strategies, highlighting their trade-offs in vocabulary size, out-of-vocabulary handling, sequence length, and suitability for Tiny Language Models (TinyLMs) and microcontroller deployment.

Feature / Metric	Word Tokenization	Subword Tokenization	Character Tokenization
Core Unit	Entire words (e.g., 'running')	Frequent subword units (e.g., 'run', '##ning')	Individual characters (e.g., 'r', 'u', 'n')
Vocabulary Size	Very Large (50k-200k+)	Fixed & Manageable (e.g., 30k)	Tiny (< 1k)
Out-of-Vocabulary (OOV) Handling	❌ Poor (requires fallback)	✅ Excellent (via subword composition)	✅ Perfect (no OOV)
Typical Sequence Length	Shortest	Moderate	Longest (5-10x word count)
Semantic Richness per Token	Highest	High	Lowest
Model Embedding Layer Size	Largest (major memory cost)	Controlled	Smallest
Suitability for Tiny Language Models	❌ Poor (memory prohibitive)	✅ Excellent (balanced efficiency)	⚠️ Possible (but long sequences)
Common Algorithms / Examples	Whitespace splitting, SpaCy	Byte-Pair Encoding (BPE), WordPiece, SentencePiece	Direct character mapping, UTF-8 encoding

SUBWORD TOKENIZATION

Frequently Asked Questions

Subword tokenization is a core text processing technique for modern language models, especially critical for deployment on memory-constrained devices. It balances the efficiency of word-level tokenization with the flexibility of character-level processing.

Subword tokenization is a text segmentation method that breaks words into smaller, frequently occurring units called subwords or wordpieces. It works by first training a tokenizer on a large corpus to learn a fixed-size vocabulary of the most common character sequences (subwords). During inference, the algorithm greedily decomposes unknown words into known subwords from this vocabulary, allowing the model to handle a vast and open vocabulary with a limited token set. For example, the word "unhappiness" might be tokenized as ["un", "happ", "iness"] if those subwords are in the learned vocabulary.

This process sits between word-level tokenization (which suffers from large vocabularies and out-of-vocabulary words) and character-level tokenization (which is computationally inefficient and loses semantic meaning). Popular algorithms include Byte-Pair Encoding (BPE), WordPiece, and Unigram Language Model tokenization, each with slight variations in how the vocabulary is constructed and segmentation is performed.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

TOKENIZATION & COMPRESSION

Related Terms

Subword tokenization is a foundational component of modern NLP pipelines, enabling efficient vocabulary management. These related concepts detail the specific algorithms, libraries, and compression techniques that interact with or build upon subword methods, especially in resource-constrained TinyML environments.