Inferensys

Glossary

Subword Tokenization

Subword tokenization is a text processing method that decomposes words into smaller, frequently occurring units to handle large vocabularies and out-of-vocabulary words efficiently.
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
NLP PREPROCESSING

What is Subword Tokenization?

Subword tokenization is a core natural language processing technique that bridges the gap between word-level and character-level text representation.

Subword tokenization is a text segmentation method that decomposes words into smaller, frequently occurring linguistic units called subwords or wordpieces. This approach enables a language model to handle a vast and open vocabulary—including complex compounds, morphological variants, and out-of-vocabulary words—using a fixed, manageable set of tokens. It is fundamental to modern transformer-based models like BERT and GPT, providing a balance between the granularity of characters and the semantic coherence of whole words.

Common algorithms include Byte-Pair Encoding (BPE), WordPiece, and Unigram Language Model, each learning a vocabulary from a corpus by iteratively merging frequent character sequences. For Tiny Language Models deployed on microcontrollers, subword tokenization is critical. It allows a compact model to efficiently represent language with a small embedding table, directly reducing memory footprint—a key constraint in TinyML deployment. This method ensures the model can process novel terms without requiring a prohibitively large vocabulary.

FOUNDATIONAL METHODS

Core Subword Tokenization Algorithms

Subword tokenization bridges the gap between word-level and character-level processing. These core algorithms determine how text is decomposed into a finite, manageable set of tokens for language models.

01

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE) is a data compression algorithm adapted for tokenization. It starts with a base vocabulary of individual characters and iteratively merges the most frequent adjacent symbol pairs into new subword units.

  • Process: Begins with characters, counts symbol pair frequencies, and merges the most common pair. Repeats for a set number of merges.
  • Key Feature: Creates a vocabulary of variable-length subwords. Common words become single tokens (e.g., "the"), while rare words are split (e.g., "tokenization""token", "ization").
  • Use Case: The original algorithm behind OpenAI's GPT models and many other early large language models.
02

WordPiece

WordPiece is a subword tokenization algorithm used by models like BERT. It is similar to BPE but uses a different, likelihood-based merging criterion.

  • Process: Starts with a base vocabulary of characters. Instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data when added to the vocabulary.
  • Key Feature: Tends to produce a more linguistically plausible segmentation than pure frequency-based BPE. It uses a ## prefix to denote subwords that are not at the beginning of a word.
  • Use Case: The standard tokenizer for the BERT family of models and their derivatives.
03

Unigram Language Model

The Unigram Language Model tokenization algorithm starts with a large seed vocabulary and iteratively prunes it down, optimizing for overall corpus likelihood.

  • Process: Assumes each word's tokenization is independent (a unigram assumption). Begins with a large overcomplete vocabulary (e.g., all frequent substrings) and uses the Expectation-Maximization algorithm to estimate subword probabilities. The vocabulary is shrunk by removing the least impactful subwords.
  • Key Feature: Can output multiple possible segmentations with probabilities for a single word, allowing for sampling-based segmentation. It is inherently probabilistic.
  • Use Case: Used by the SentencePiece tool and models like ALBERT and T5.
05

Byte-Level BPE (BBPE)

Byte-Level BPE (BBPE) is a variant of BPE that operates directly on UTF-8 encoded bytes, rather than Unicode characters. This creates a truly universal, small vocabulary.

  • Process: Text is encoded into a sequence of bytes (256 possible values). The standard BPE merging process is applied to these byte sequences.
  • Key Feature: The vocabulary size is capped at 256 + merges, guaranteeing a small, fixed base. It can represent any text without an <UNK> token, as out-of-vocabulary words are simply decomposed into many byte tokens.
  • Trade-off: Extremely small vocabulary is efficient, but longer sequences for complex characters or rare words can increase context window usage.
  • Use Case: Used by OpenAI's GPT-2 and RoBERTa to handle diverse text and emojis robustly.
06

Algorithm Comparison & Trade-offs

Choosing a subword algorithm involves balancing vocabulary size, token sequence length, and linguistic coherence.

  • Vocabulary Size vs. Sequence Length: A larger vocabulary leads to shorter, more efficient token sequences but risks overfitting to the training corpus. A tiny vocabulary (like BBPE) guarantees coverage but produces long sequences.
  • Linguistic Coherence: BPE/WordPiece merges are greedy and can produce non-intuitive splits. The Unigram model's probabilistic approach can be more flexible.
  • TinyML Consideration: For microcontroller deployment, a small, fixed vocabulary is critical. Techniques like vocabulary pruning (removing rare tokens) or using BBPE are essential to shrink the embedding matrix, which is often a model's largest layer.
TINY LANGUAGE MODELS

How Subword Tokenization Works

Subword tokenization is a core text processing method for language models, especially critical for deployment on resource-constrained hardware.

Subword tokenization is a text segmentation method that breaks words into smaller, frequently occurring units called subwords or wordpieces. This approach enables a model to handle a vast vocabulary and out-of-vocabulary words with a fixed, manageable set of tokens, which is essential for memory-constrained tiny language models. Instead of a word-level vocabulary that grows unbounded, algorithms like Byte-Pair Encoding (BPE) or Unigram Language Model statistically learn a subword vocabulary from a training corpus, balancing granularity and token count.

For tiny machine learning deployment, this method provides significant efficiency gains. A compact subword vocabulary drastically reduces the size of the model's embedding layer, a major memory consumer. It also improves generalization by allowing the model to construct and understand novel words from known subword components. Libraries like SentencePiece implement these algorithms in a language-agnostic way, processing raw text as a sequence of Unicode characters to support multilingual models on edge devices.

COMPARISON

Subword vs. Word vs. Character Tokenization

A technical comparison of three fundamental text tokenization strategies, highlighting their trade-offs in vocabulary size, out-of-vocabulary handling, sequence length, and suitability for Tiny Language Models (TinyLMs) and microcontroller deployment.

Feature / MetricWord TokenizationSubword TokenizationCharacter Tokenization

Core Unit

Entire words (e.g., 'running')

Frequent subword units (e.g., 'run', '##ning')

Individual characters (e.g., 'r', 'u', 'n')

Vocabulary Size

Very Large (50k-200k+)

Fixed & Manageable (e.g., 30k)

Tiny (< 1k)

Out-of-Vocabulary (OOV) Handling

❌ Poor (requires fallback)

✅ Excellent (via subword composition)

✅ Perfect (no OOV)

Typical Sequence Length

Shortest

Moderate

Longest (5-10x word count)

Semantic Richness per Token

Highest

High

Lowest

Model Embedding Layer Size

Largest (major memory cost)

Controlled

Smallest

Suitability for Tiny Language Models

❌ Poor (memory prohibitive)

✅ Excellent (balanced efficiency)

⚠️ Possible (but long sequences)

Common Algorithms / Examples

Whitespace splitting, SpaCy

Byte-Pair Encoding (BPE), WordPiece, SentencePiece

Direct character mapping, UTF-8 encoding

SUBWORD TOKENIZATION

Frequently Asked Questions

Subword tokenization is a core text processing technique for modern language models, especially critical for deployment on memory-constrained devices. It balances the efficiency of word-level tokenization with the flexibility of character-level processing.

Subword tokenization is a text segmentation method that breaks words into smaller, frequently occurring units called subwords or wordpieces. It works by first training a tokenizer on a large corpus to learn a fixed-size vocabulary of the most common character sequences (subwords). During inference, the algorithm greedily decomposes unknown words into known subwords from this vocabulary, allowing the model to handle a vast and open vocabulary with a limited token set. For example, the word "unhappiness" might be tokenized as ["un", "happ", "iness"] if those subwords are in the learned vocabulary.

This process sits between word-level tokenization (which suffers from large vocabularies and out-of-vocabulary words) and character-level tokenization (which is computationally inefficient and loses semantic meaning). Popular algorithms include Byte-Pair Encoding (BPE), WordPiece, and Unigram Language Model tokenization, each with slight variations in how the vocabulary is constructed and segmentation is performed.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.