Inferensys

Glossary

Tokenization

Tokenization is the foundational process in natural language processing that splits raw text into smaller units called tokens, which are the basic input for language models and retrieval systems.
Developer building agentic RAG system, retrieval pipeline diagram on laptop, technical workspace with notes.
DOCUMENT CHUNKING STRATEGIES

What is Tokenization?

The foundational process of segmenting raw text into smaller units for machine processing.

Tokenization is the computational process of breaking raw text into smaller, meaningful units called tokens, which serve as the fundamental input for language models and retrieval systems. In natural language processing (NLP), a token typically corresponds to a word, subword, or character, depending on the algorithm. This segmentation is the critical first step in document chunking strategies, as it determines the atomic units from which semantic chunks are built for efficient indexing and retrieval. Common algorithms include Byte-Pair Encoding (BPE) and WordPiece, which handle out-of-vocabulary words by using subword units.

The choice of tokenizer directly impacts chunk granularity and a system's ability to respect a model's maximum context length. For Retrieval-Augmented Generation (RAG) architectures, consistent tokenization between the indexing and query phases is essential for accurate semantic search. Mismatched tokenization can lead to retrieval failures. Furthermore, tokenizers define the vocabulary of a model, influencing its understanding of domain-specific terminology and its efficiency in processing enterprise data connectors that feed proprietary information into the system.

FOUNDATIONAL NLP

Types of Tokenization

Tokenization is the process of breaking raw text into smaller, meaningful units called tokens. The chosen strategy is a foundational architectural decision that directly impacts model performance, vocabulary size, and handling of out-of-vocabulary terms.

01

Word Tokenization

Word tokenization splits text into discrete words based on whitespace and punctuation. It is the most intuitive method, creating a vocabulary where each unique word is a token.

  • Example: "Don't split this!" becomes ["Don't", "split", "this", "!"].
  • Pros: Simple, human-interpretable.
  • Cons: Leads to large vocabularies, poor handling of out-of-vocabulary (OOV) words (e.g., "tokenization" → <UNK>), and inconsistent treatment of contractions and compound words.
02

Subword Tokenization

Subword tokenization splits text into units smaller than words but larger than characters, such as prefixes, suffixes, and common character sequences. This is the dominant method in modern LLMs (e.g., GPT, LLaMA).

  • Key Algorithms: Byte-Pair Encoding (BPE), WordPiece, Unigram Language Model.
  • Example: "tokenization" might be split into ["token", "ization"].
  • Pros: Balances vocabulary size and semantic meaning, effectively handles OOV words by breaking them into known subwords, and improves model efficiency.
03

Character Tokenization

Character tokenization treats each individual character (including letters, digits, and punctuation) as a token.

  • Example: "cat" becomes ["c", "a", "t"].
  • Pros: Extremely small vocabulary (e.g., < 1000 tokens), eliminates OOV problems entirely.
  • Cons: Produces very long sequences for the model to process, obscuring semantic and morphological relationships. It is computationally inefficient for standard language modeling but can be useful for specialized tasks like spelling correction or generative work on rare scripts.
04

Sentence Tokenization

Sentence tokenization (or sentence segmentation) splits a document into its constituent sentences. This is a higher-level, often rule-based, preprocessing step that typically precedes word or subword tokenization.

  • Example: "Hello world. How are you?" becomes ["Hello world.", "How are you?"].
  • Key Challenge: Sentence boundary disambiguation (e.g., distinguishing periods that end sentences from those in abbreviations like "Dr." or decimals).
  • Use Case: Critical for semantic chunking and sentence window retrieval in RAG systems, where preserving full sentence integrity maintains context.
05

Byte-Level Tokenization

Byte-level tokenization operates on the raw byte representation of text. A prominent example is the Byte-level BPE (BBPE) used in models like GPT-2.

  • Mechanism: It applies BPE merges directly to UTF-8 byte sequences.
  • Pros: Truly vocabulary-free; can represent any text in any language without predefining a character set. This makes it highly flexible for multilingual and code corpora.
  • Cons: Can produce longer sequences than character-level tokenization for non-Latin scripts, as a single character may be encoded as multiple bytes.
06

Whitespace Tokenization

Whitespace tokenization is a simplistic form of word tokenization that splits text only on whitespace characters (spaces, tabs, newlines), ignoring punctuation.

  • Example: "Hello, world! How's it going?" becomes ["Hello,", "world!", "How's", "it", "going?"].
  • Pros: Fast and trivial to implement.
  • Cons: Punctuation remains attached to words, creating redundant vocabulary entries (e.g., "world" and "world!" are separate tokens). It is rarely used in production NLP systems but serves as a basic baseline.
FOUNDATIONAL PROCESS

How Tokenization Works in RAG & Document Chunking

Tokenization is the critical first step in preparing text for both language models and retrieval systems, directly impacting chunking strategies and overall system performance.

Tokenization is the foundational natural language processing operation that splits raw text into smaller, meaningful units called tokens, which are the atomic inputs for language models and embedding models. In the context of Retrieval-Augmented Generation (RAG) and document chunking, tokenization determines the precise boundaries and size of text segments, as models have strict maximum context lengths measured in tokens. Common algorithms include Byte-Pair Encoding (BPE) and WordPiece, which handle subwords to manage out-of-vocabulary terms efficiently.

Effective document chunking must be planned around a model's tokenizer to prevent chunks from exceeding context limits or splitting mid-token, which corrupts meaning. Strategies like recursive character text splitting often use token counts as their primary size metric. Furthermore, the choice of tokenizer affects the semantic coherence of a chunk's embedding, as the same text processed by different tokenizers will yield different token sequences and, consequently, different vector representations for retrieval.

SUBWORD TOKENIZATION

Tokenization Algorithm Comparison

A technical comparison of core algorithms used to split text into tokens, the foundational step for document chunking and model input.

Algorithm / FeatureByte-Pair Encoding (BPE)WordPieceUnigram Language ModelSentencePiece

Core Methodology

Iteratively merges most frequent character pairs

Maximizes likelihood of training data; merges based on likelihood

Starts with a large vocabulary and iteratively removes least important units

Language-agnostic framework implementing BPE, Unigram, etc.

Vocabulary Construction

Data-driven from corpus

Data-driven from corpus

Data-driven from corpus

Data-driven from raw text (no pre-tokenization)

Handles Unknown Words

Preserves Word Boundaries

Common Use Cases

GPT family models (GPT-2, GPT-3)

BERT, DistilBERT

XLNet, ALBERT

T5, many multilingual models

Language Agnostic

Requires Pre-tokenization

Deterministic Encoding

Depends on underlying model

TOKENIZATION

Frequently Asked Questions

Tokenization is the foundational process that converts raw text into discrete units for machine processing. These questions address its core mechanics, variations, and critical role in modern AI systems like Retrieval-Augmented Generation (RAG).

Tokenization is the process of breaking down raw text into smaller, manageable units called tokens, which serve as the basic input for language models. It works by applying a deterministic algorithm—such as Byte-Pair Encoding (BPE), WordPiece, or SentencePiece—to a text corpus. The algorithm first learns a vocabulary of common character sequences (subwords) from training data. During inference, it segments new text by iteratively matching the longest possible subword sequences from this vocabulary. For example, the word "unhappily" might be tokenized into ["un", "##happ", "##ily"] by a subword tokenizer, where ## denotes a continuation. This method balances vocabulary size with the ability to handle rare words and out-of-vocabulary terms.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.