Glossary

Nucleus Sampling (Top-p)

Nucleus sampling (top-p) is a text generation decoding method that dynamically truncates the vocabulary at each step, considering only the smallest set of top tokens whose cumulative probability mass exceeds a threshold p, balancing diversity and quality.

Get in touch Learn more

QA engineer performing AI quality assurance on laptop, test results visible, casual technical debugging session.

DECODING METHOD

What is Nucleus Sampling (Top-p)?

Nucleus sampling (top-p sampling) is a probabilistic text generation decoding method that dynamically selects from a truncated vocabulary at each generation step.

Nucleus sampling (top-p sampling) is a text generation decoding method that, at each step, dynamically truncates the vocabulary by considering only the smallest set of top tokens whose cumulative probability mass exceeds a predefined threshold p (e.g., 0.9). This creates a dynamic probability distribution from which the next token is sampled, filtering out low-probability long-tail tokens to improve output coherence while retaining diversity. It contrasts with top-k sampling, which uses a fixed number of tokens regardless of their probability mass.

The method directly influences confidence scoring for outputs by pruning low-likelihood tokens, making the model's effective probability distribution sharper and more focused. This reduces the chance of generating nonsensical or highly uncertain text, providing a more reliable signal for downstream confidence metrics. It is a key technique in Recursive Error Correction systems, where high-quality, diverse candidate outputs are needed for iterative refinement and self-evaluation loops.

DECODING METHOD

Key Characteristics of Nucleus Sampling

Nucleus sampling (top-p sampling) is a probabilistic text generation method that dynamically selects tokens from a truncated vocabulary, balancing creativity and coherence by filtering out low-probability tails.

Dynamic Vocabulary Truncation

Unlike top-k sampling, which uses a fixed number of tokens, nucleus sampling dynamically adjusts the candidate set at each generation step. It calculates the cumulative probability distribution of the vocabulary and includes only the smallest set of top tokens whose combined probability mass exceeds the threshold p (e.g., 0.9). This means the size of the candidate pool can vary significantly from one step to the next based on the shape of the probability distribution.

Probability Mass Threshold (p)

The core parameter p (typically between 0.7 and 0.95) defines the minimum cumulative probability mass from which to sample.

A high p (e.g., 0.95) includes a broader set of tokens, increasing diversity but also the risk of incoherence.
A low p (e.g., 0.7) creates a very narrow, high-probability set, leading to more deterministic and conservative outputs. The method effectively cuts off the long tail of improbable tokens, preventing the generation of nonsensical or highly erratic text.

Controlled Randomness & Coherence

Nucleus sampling introduces controlled randomness by redistributing the probability mass among the selected nucleus of tokens and sampling from this renormalized distribution. This achieves a key balance:

Avoids pure greediness (always choosing the top-1 token), which leads to repetitive and bland text.
Mitigates the flaws of top-k, where a fixed k can sometimes include very low-probability tokens (if the distribution is flat) or exclude plausible ones (if it is peaky). The result is text that is both coherent and creatively varied, making it a preferred method for creative writing, dialogue, and other open-ended generation tasks.

Contrast with Top-k Sampling

Top-k sampling and nucleus sampling are both stochastic decoding methods, but they differ fundamentally in candidate selection:

Top-k: Always selects the k highest-probability tokens, regardless of their actual probability values. This can be problematic if the distribution is very flat (including many poor tokens) or very sharp (excluding good ones).
Nucleus (Top-p): Selects tokens based on a probability mass threshold, making the candidate set adaptive to the distribution's shape at each step. In practice, nucleus sampling often outperforms top-k in human evaluations of fluency and diversity, though they can be combined (e.g., top-p after top-k filtering) for additional control.

Role in Confidence & Uncertainty

Within the context of confidence scoring, the nucleus sampling process itself provides a weak signal. The size and total probability mass of the selected nucleus at a given step can be indicative of the model's local uncertainty.

A small, high-mass nucleus suggests the model is very confident about the next token.
A large, diffuse nucleus indicates higher uncertainty, with many plausible continuations. While not a formal confidence score, monitoring the nucleus composition can be part of a broader uncertainty quantification strategy for text generation systems.

Implementation & Typical Use

Nucleus sampling is implemented in most major transformer libraries (e.g., Hugging Face's transformers). A standard implementation involves:

Sorting the token probabilities in descending order.
Calculating the cumulative sum.
Selecting all tokens where the cumulative sum <= p.
Renormalizing the probabilities of this subset.
Sampling the next token from this new distribution. It is widely used as the default or recommended decoding method for creative and conversational AI applications in models like GPT-3, due to its superior balance of quality and diversity compared to pure greedy or beam search.

DECODING METHOD COMPARISON

Nucleus Sampling vs. Other Decoding Methods

A feature comparison of nucleus sampling (top-p) against other common text generation strategies, highlighting trade-offs in output diversity, quality, and control.

Feature / Metric	Nucleus Sampling (Top-p)	Greedy Decoding	Top-k Sampling	Temperature Scaling	Beam Search
Core Mechanism	Dynamically selects the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9).	Always selects the single token with the highest probability.	Selects from the k tokens with the highest probabilities.	Scales logits with a temperature parameter T before applying softmax.	Maintains multiple candidate sequences (beams), selecting the highest overall probability sequence.
Primary Control Parameter	Probability threshold p (0.0 to 1.0).	None (deterministic).	Integer k (vocabulary size).	Temperature T (>0).	Number of beams b (integer).
Output Diversity	High, adapts to distribution shape; avoids tail.	None (fully deterministic).	Medium to High, fixed tail cutoff.	Continuously tunable from deterministic (T→0) to random (T→∞).	Low, explores high-probability sequences.
Coherence & Quality	High, avoids low-probability nonsense tokens.	High, but can be repetitive.	Variable; low k can cause incoherence, high k can cause randomness.	Variable; high T reduces coherence.	Very High, optimizes for sequence likelihood.
Runtime & Compute Cost	Low to Medium (requires sorting cumulative probs).	Lowest (single argmax).	Low (requires finding top k).	Low (simple scalar operation).	High (O(b * sequence length)).
Handles "Flat" Distributions
Prone to Repetition
Common Use Case	Creative writing, chat applications.	Debugging, deterministic outputs.	Older language models (GPT-2).	Tuning diversity in combination with other methods.	Machine translation, formal text generation.
Typical Parameter Range	p = 0.7 to 0.95	N/A	k = 10 to 100	T = 0.5 to 1.5	b = 4 to 10

DECODING ALGORITHMS

Implementation in Platforms & Frameworks

Nucleus sampling (top-p sampling) is implemented as a core decoding parameter in major AI platforms and generative AI frameworks. It is a standard alternative to greedy decoding and top-k sampling for controlling output diversity.

Hugging Face Transformers

The Hugging Face transformers library implements nucleus sampling via the top_p parameter in its text generation pipelines and model .generate() methods.

Primary Parameter: top_p (float, default 1.0). A value of 0.9 is common.
Usage: output = model.generate(input_ids, top_p=0.92, do_sample=True)
Combination: Often used with temperature to further control randomness. The library dynamically constructs the nucleus (smallest set of tokens whose cumulative probability ≥ top_p) at each generation step.
Framework: This is the de facto standard for open-source LLM experimentation and deployment.

EXPLORE

OpenAI API & Chat Completions

The OpenAI API offers nucleus sampling as a primary parameter for its gpt-4, gpt-3.5-turbo, and other chat completion models.

API Parameter: top_p (number, optional). Defaults to 1. The API recommends using either top_p or temperature, but not both.
Example Request: {"model": "gpt-4", "messages": [...], "top_p": 0.95}
Effect: It alters the core sampling distribution of the model before a token is selected, making it fundamental for reducing repetition and increasing creativity in enterprise chatbot and content generation applications.

EXPLORE

Anthropic Claude API

Anthropic's Claude API provides top_p (called top_p) as a key sampling parameter alongside temperature.

Parameter: top_p (float, range 0.0 to 1.0). Default is 1.0.
Implementation Detail: The API applies nucleus sampling to the model's next-token probability distribution. A lower value (e.g., 0.1) makes outputs more deterministic and focused.
Best Practice: Anthropic's documentation notes that top_p and temperature both affect randomness and suggests adjusting one at a time. This parameter is critical for tuning the precision of Claude's reasoning in agentic workflows.

EXPLORE

vLLM & High-Performance Serving

vLLM, a high-throughput LLM serving engine, implements nucleus sampling efficiently for production inference.

Engine Optimization: vLLM's PagedAttention kernel is designed to work with sampling parameters like top_p and temperature with minimal overhead.
Serving API: Its OpenAI-compatible API server exposes the top_p parameter directly.
Performance: The implementation ensures that the dynamic vocabulary truncation required for top-p does not become a bottleneck during continuous batching, which is essential for cost-effective, high-volume enterprise deployment.

EXPLORE

TensorFlow & PyTorch Native

While high-level APIs abstract it, nucleus sampling can be implemented directly in deep learning frameworks for custom sampling logic.

PyTorch Logic: Involves using torch.sort() and torch.cumsum() on the logits to find the nucleus index, then applying torch.multinomial() to the filtered distribution.
TensorFlow Logic: Similar process using tf.math.top_k, tf.cumsum, and tf.random.categorical.
Custom Use Case: This low-level control is used in research for novel decoding algorithms or when integrating sampling into custom autoregressive model architectures beyond standard transformers.

EXPLORE

LangChain & LLM Application Frameworks

LangChain and similar agent frameworks expose top_p as a configuration parameter for the underlying LLM providers they abstract.

Chain Configuration: When initializing an LLM wrapper (e.g., ChatOpenAI, ChatAnthropic), top_p is passed as a model keyword argument: ChatOpenAI(model="gpt-4", top_p=0.9, temperature=0).
Agent Tuning: This allows developers to fix the decoding strategy for an entire agentic chain, ensuring consistent output characteristics for steps like planning, tool execution, and response generation. It's a key lever for balancing creativity and determinism in autonomous systems.

EXPLORE

NUCLEUS SAMPLING

Frequently Asked Questions

Nucleus sampling (top-p sampling) is a core text generation technique that balances creativity and coherence. These FAQs address its mechanics, trade-offs, and practical applications for developers and ML engineers.

Nucleus sampling (top-p sampling) is a probabilistic text generation decoding method that dynamically truncates the model's vocabulary at each generation step. It works by sorting the predicted tokens by descending probability, then selecting the smallest set of top tokens whose cumulative probability mass exceeds a predefined threshold p (e.g., 0.9). The model then randomly samples the next token from this dynamically-sized 'nucleus' or subset, renormalizing the probabilities within it. This contrasts with top-k sampling, which uses a fixed number of tokens, and greedy decoding, which always picks the single most likely token.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

DECODING & SAMPLING

Related Terms

Nucleus sampling is one of several techniques used to control the output of autoregressive language models. These related methods govern the trade-off between creativity, coherence, and determinism in text generation.

Temperature Scaling

A hyperparameter that controls the randomness of predictions by scaling the logits before applying the softmax function.

Lower temperature (<1.0) sharpens the probability distribution, making the model more confident and deterministic, favoring high-probability tokens.
Higher temperature (>1.0) flattens the distribution, increasing randomness and diversity in the output.
Often used in conjunction with nucleus sampling, where temperature adjusts the distribution before the top-p truncation is applied.

EXPLORE

Top-k Sampling

A decoding method that restricts random sampling at each step to the k tokens with the highest probabilities.

Unlike nucleus sampling's dynamic vocabulary, top-k uses a fixed-size shortlist.
Can be problematic if the probability distribution is flat (too many low-probability candidates in the top-k) or sharp (excluding plausible candidates outside the fixed k).
Nucleus sampling (top-p) was developed to address these limitations by adapting the candidate set size based on the distribution's shape.

Greedy Decoding

The simplest decoding strategy which, at each step, selects the token with the single highest probability.

Leads to deterministic outputs but often results in repetitive, dull, and sometimes nonsensical text due to the lack of exploration.
Contrast with beam search, which is a heuristic search algorithm that explores multiple high-probability sequences in parallel but can still suffer from blandness and repetition.

Beam Search

A heuristic search algorithm that explores the most promising sequences by maintaining multiple hypotheses (beams) at each generation step.

Expands the beam_width number of most likely sequences at each step, pruning the rest.
Aims to find a high-probability sequence overall, not just make optimal local choices (like greedy decoding).
Tends to produce more fluent and grammatically correct text than greedy decoding but can be computationally heavier and still lacks the creativity of stochastic methods like nucleus sampling.

Typical Sampling

An alternative to nucleus sampling that selects tokens from the smallest set whose information content (surprisal) is typical, given the model's entropy.

Instead of truncating by cumulative probability (top-p), it truncates by negative log probability (surprisal).
Aims to match the information content of human-generated text more closely, potentially reducing the likelihood of generating bland or overly eccentric text.
Defined by the hyperparameter tau (typicality), analogous to p in nucleus sampling.

Eta Sampling (η)

A sampling method that dynamically adjusts the vocabulary size based on a perplexity threshold, aiming for more consistent output quality across different contexts.

Modifies the probability distribution by cutting off tokens with probability below η * max_probability at that step.
Designed to be more robust across diverse prompts and model sizes compared to fixed-parameter methods.
Represents another approach to the core problem nucleus sampling solves: dynamically determining which tokens are plausible candidates for the next step.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Nucleus Sampling (Top-p)

What is Nucleus Sampling (Top-p)?

Key Characteristics of Nucleus Sampling

Dynamic Vocabulary Truncation

Probability Mass Threshold (p)

Controlled Randomness & Coherence

Contrast with Top-k Sampling

Role in Confidence & Uncertainty

Implementation & Typical Use

Nucleus Sampling vs. Other Decoding Methods

Implementation in Platforms & Frameworks

Hugging Face Transformers

OpenAI API & Chat Completions

Anthropic Claude API

vLLM & High-Performance Serving

TensorFlow & PyTorch Native

LangChain & LLM Application Frameworks

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Temperature Scaling

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there