Nucleus sampling (top-p sampling) is a text generation decoding method that, at each step, dynamically truncates the vocabulary by considering only the smallest set of top tokens whose cumulative probability mass exceeds a predefined threshold p (e.g., 0.9). This creates a dynamic probability distribution from which the next token is sampled, filtering out low-probability long-tail tokens to improve output coherence while retaining diversity. It contrasts with top-k sampling, which uses a fixed number of tokens regardless of their probability mass.
Glossary
Nucleus Sampling (Top-p)

What is Nucleus Sampling (Top-p)?
Nucleus sampling (top-p sampling) is a probabilistic text generation decoding method that dynamically selects from a truncated vocabulary at each generation step.
The method directly influences confidence scoring for outputs by pruning low-likelihood tokens, making the model's effective probability distribution sharper and more focused. This reduces the chance of generating nonsensical or highly uncertain text, providing a more reliable signal for downstream confidence metrics. It is a key technique in Recursive Error Correction systems, where high-quality, diverse candidate outputs are needed for iterative refinement and self-evaluation loops.
Key Characteristics of Nucleus Sampling
Nucleus sampling (top-p sampling) is a probabilistic text generation method that dynamically selects tokens from a truncated vocabulary, balancing creativity and coherence by filtering out low-probability tails.
Dynamic Vocabulary Truncation
Unlike top-k sampling, which uses a fixed number of tokens, nucleus sampling dynamically adjusts the candidate set at each generation step. It calculates the cumulative probability distribution of the vocabulary and includes only the smallest set of top tokens whose combined probability mass exceeds the threshold p (e.g., 0.9). This means the size of the candidate pool can vary significantly from one step to the next based on the shape of the probability distribution.
Probability Mass Threshold (p)
The core parameter p (typically between 0.7 and 0.95) defines the minimum cumulative probability mass from which to sample.
- A high p (e.g., 0.95) includes a broader set of tokens, increasing diversity but also the risk of incoherence.
- A low p (e.g., 0.7) creates a very narrow, high-probability set, leading to more deterministic and conservative outputs. The method effectively cuts off the long tail of improbable tokens, preventing the generation of nonsensical or highly erratic text.
Controlled Randomness & Coherence
Nucleus sampling introduces controlled randomness by redistributing the probability mass among the selected nucleus of tokens and sampling from this renormalized distribution. This achieves a key balance:
- Avoids pure greediness (always choosing the top-1 token), which leads to repetitive and bland text.
- Mitigates the flaws of top-k, where a fixed k can sometimes include very low-probability tokens (if the distribution is flat) or exclude plausible ones (if it is peaky). The result is text that is both coherent and creatively varied, making it a preferred method for creative writing, dialogue, and other open-ended generation tasks.
Contrast with Top-k Sampling
Top-k sampling and nucleus sampling are both stochastic decoding methods, but they differ fundamentally in candidate selection:
- Top-k: Always selects the
khighest-probability tokens, regardless of their actual probability values. This can be problematic if the distribution is very flat (including many poor tokens) or very sharp (excluding good ones). - Nucleus (Top-p): Selects tokens based on a probability mass threshold, making the candidate set adaptive to the distribution's shape at each step. In practice, nucleus sampling often outperforms top-k in human evaluations of fluency and diversity, though they can be combined (e.g., top-p after top-k filtering) for additional control.
Role in Confidence & Uncertainty
Within the context of confidence scoring, the nucleus sampling process itself provides a weak signal. The size and total probability mass of the selected nucleus at a given step can be indicative of the model's local uncertainty.
- A small, high-mass nucleus suggests the model is very confident about the next token.
- A large, diffuse nucleus indicates higher uncertainty, with many plausible continuations. While not a formal confidence score, monitoring the nucleus composition can be part of a broader uncertainty quantification strategy for text generation systems.
Implementation & Typical Use
Nucleus sampling is implemented in most major transformer libraries (e.g., Hugging Face's transformers). A standard implementation involves:
- Sorting the token probabilities in descending order.
- Calculating the cumulative sum.
- Selecting all tokens where the cumulative sum <=
p. - Renormalizing the probabilities of this subset.
- Sampling the next token from this new distribution. It is widely used as the default or recommended decoding method for creative and conversational AI applications in models like GPT-3, due to its superior balance of quality and diversity compared to pure greedy or beam search.
Nucleus Sampling vs. Other Decoding Methods
A feature comparison of nucleus sampling (top-p) against other common text generation strategies, highlighting trade-offs in output diversity, quality, and control.
| Feature / Metric | Nucleus Sampling (Top-p) | Greedy Decoding | Top-k Sampling | Temperature Scaling | Beam Search |
|---|---|---|---|---|---|
Core Mechanism | Dynamically selects the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9). | Always selects the single token with the highest probability. | Selects from the k tokens with the highest probabilities. | Scales logits with a temperature parameter T before applying softmax. | Maintains multiple candidate sequences (beams), selecting the highest overall probability sequence. |
Primary Control Parameter | Probability threshold p (0.0 to 1.0). | None (deterministic). | Integer k (vocabulary size). | Temperature T (>0). | Number of beams b (integer). |
Output Diversity | High, adapts to distribution shape; avoids tail. | None (fully deterministic). | Medium to High, fixed tail cutoff. | Continuously tunable from deterministic (T→0) to random (T→∞). | Low, explores high-probability sequences. |
Coherence & Quality | High, avoids low-probability nonsense tokens. | High, but can be repetitive. | Variable; low k can cause incoherence, high k can cause randomness. | Variable; high T reduces coherence. | Very High, optimizes for sequence likelihood. |
Runtime & Compute Cost | Low to Medium (requires sorting cumulative probs). | Lowest (single argmax). | Low (requires finding top k). | Low (simple scalar operation). | High (O(b * sequence length)). |
Handles "Flat" Distributions | |||||
Prone to Repetition | |||||
Common Use Case | Creative writing, chat applications. | Debugging, deterministic outputs. | Older language models (GPT-2). | Tuning diversity in combination with other methods. | Machine translation, formal text generation. |
Typical Parameter Range | p = 0.7 to 0.95 | N/A | k = 10 to 100 | T = 0.5 to 1.5 | b = 4 to 10 |
Implementation in Platforms & Frameworks
Nucleus sampling (top-p sampling) is implemented as a core decoding parameter in major AI platforms and generative AI frameworks. It is a standard alternative to greedy decoding and top-k sampling for controlling output diversity.
Frequently Asked Questions
Nucleus sampling (top-p sampling) is a core text generation technique that balances creativity and coherence. These FAQs address its mechanics, trade-offs, and practical applications for developers and ML engineers.
Nucleus sampling (top-p sampling) is a probabilistic text generation decoding method that dynamically truncates the model's vocabulary at each generation step. It works by sorting the predicted tokens by descending probability, then selecting the smallest set of top tokens whose cumulative probability mass exceeds a predefined threshold p (e.g., 0.9). The model then randomly samples the next token from this dynamically-sized 'nucleus' or subset, renormalizing the probabilities within it. This contrasts with top-k sampling, which uses a fixed number of tokens, and greedy decoding, which always picks the single most likely token.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Nucleus sampling is one of several techniques used to control the output of autoregressive language models. These related methods govern the trade-off between creativity, coherence, and determinism in text generation.
Top-k Sampling
A decoding method that restricts random sampling at each step to the k tokens with the highest probabilities.
- Unlike nucleus sampling's dynamic vocabulary, top-k uses a fixed-size shortlist.
- Can be problematic if the probability distribution is flat (too many low-probability candidates in the top-k) or sharp (excluding plausible candidates outside the fixed k).
- Nucleus sampling (top-p) was developed to address these limitations by adapting the candidate set size based on the distribution's shape.
Greedy Decoding
The simplest decoding strategy which, at each step, selects the token with the single highest probability.
- Leads to deterministic outputs but often results in repetitive, dull, and sometimes nonsensical text due to the lack of exploration.
- Contrast with beam search, which is a heuristic search algorithm that explores multiple high-probability sequences in parallel but can still suffer from blandness and repetition.
Beam Search
A heuristic search algorithm that explores the most promising sequences by maintaining multiple hypotheses (beams) at each generation step.
- Expands the
beam_widthnumber of most likely sequences at each step, pruning the rest. - Aims to find a high-probability sequence overall, not just make optimal local choices (like greedy decoding).
- Tends to produce more fluent and grammatically correct text than greedy decoding but can be computationally heavier and still lacks the creativity of stochastic methods like nucleus sampling.
Typical Sampling
An alternative to nucleus sampling that selects tokens from the smallest set whose information content (surprisal) is typical, given the model's entropy.
- Instead of truncating by cumulative probability (top-p), it truncates by negative log probability (surprisal).
- Aims to match the information content of human-generated text more closely, potentially reducing the likelihood of generating bland or overly eccentric text.
- Defined by the hyperparameter
tau(typicality), analogous topin nucleus sampling.
Eta Sampling (η)
A sampling method that dynamically adjusts the vocabulary size based on a perplexity threshold, aiming for more consistent output quality across different contexts.
- Modifies the probability distribution by cutting off tokens with probability below
η * max_probabilityat that step. - Designed to be more robust across diverse prompts and model sizes compared to fixed-parameter methods.
- Represents another approach to the core problem nucleus sampling solves: dynamically determining which tokens are plausible candidates for the next step.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us