Speculative decoding is an inference acceleration technique where a small, fast draft model proposes a sequence of tokens that are then verified in parallel by a larger, more accurate target model, reducing the number of slow autoregressive steps. This method leverages the observation that many token sequences are predictable, allowing the draft model to 'guess ahead' while the target model efficiently accepts or rejects these proposals in a single, batched forward pass, significantly improving time per output token (TPOT).
Glossary
Speculative Decoding

What is Speculative Decoding?
Speculative decoding is an advanced inference acceleration technique designed to reduce the latency of large language models by minimizing the number of slow, sequential autoregressive steps.
The process hinges on a speculative execution and verification loop. The draft model generates several candidate tokens rapidly. The target model then processes this entire block in parallel, using a modified attention mechanism to score the proposals against its own distribution. Accepted tokens are kept, and generation continues from the first rejection. This technique, distinct from continuous batching, directly attacks the sequential bottleneck of decoding latency, offering speedups of 2-3x without altering the final output distribution of the primary model.
Key Components of Speculative Decoding
Speculative decoding accelerates inference by using a small, fast draft model to propose token sequences, which are then verified in parallel by a larger, more accurate target model. This reduces the number of slow, sequential autoregressive steps.
Draft Model
A small, computationally inexpensive language model (e.g., a distilled version of the target) that runs autoregressively to propose a sequence of candidate tokens (the 'draft'). Its speed is critical, as its purpose is to generate multiple tokens with minimal latency to keep the target model's verification step saturated.
- Role: Fast, approximate token proposal.
- Characteristics: Typically 10-100x smaller than the target model.
- Output: A sequence of γ (gamma) speculative tokens.
Target Model
The primary, large, and accurate model (e.g., GPT-4, Llama 3) that performs the parallel verification of the draft model's proposed tokens. Instead of generating tokens one-by-one, it evaluates the entire draft sequence in a single, non-autoregressive forward pass.
- Role: Authority and verification.
- Key Operation: Computes logits for all draft positions simultaneously.
- Benefit: Replaces multiple slow autoregressive steps with one larger, but parallelizable, computation.
Parallel Verification & Acceptance
The core algorithmic step where the target model processes the draft tokens in parallel. It uses a modified sampling algorithm (often based on token probability distributions) to determine the longest correct prefix of the draft.
- Process: The target model's output distribution at each position is compared to the draft token.
- Acceptance Rule: A draft token is accepted if a randomly sampled probability allows it; the first rejected token triggers a resample from the target model's distribution.
- Result: A variable number of accepted tokens (often 3-5) from a single target model forward pass.
Key-Value (KV) Cache Management
Efficient management of the attention Key-Value cache is essential for performance. Both models maintain separate caches, but their states must be aligned after the verification step.
- Draft Model Cache: Built autoregressively during draft generation.
- Target Model Cache: Populated during the parallel verification pass for all accepted tokens.
- Synchronization: After acceptance, the target model's KV cache for the accepted prefix is used as the starting state for the next iteration, ensuring consistency.
Speedup & Efficiency Metrics
Performance is measured by the wall-clock speedup and the acceptance rate. The theoretical maximum speedup is equal to the draft length (γ + 1), but real-world gains depend on the draft model's quality and the token distributions.
- Wall-Time Speedup: Measured as
(time_standard_decoding / time_speculative_decoding). - Acceptance Rate: The average number of tokens accepted per verification step. A low rate diminishes benefits.
- Optimal Operating Point: Balances draft model size/speed against its alignment with the target model to maximize accepted tokens per verification step.
System Architecture & Scheduling
The serving system must coordinate the execution of two models with different computational profiles. This involves specialized scheduling to minimize idle time and manage memory for two sets of model weights and KV caches.
- Pipeline: Draft generation → Parallel verification → Token output/rollback → Cache update.
- Challenges: Avoiding GPU idle time between model runs and managing the memory overhead of two loaded models.
- Implementation: Often built into high-performance inference engines like vLLM or TGI, which handle the low-level orchestration.
Speculative Decoding vs. Other Inference Optimizations
A comparison of speculative decoding against other prominent methods for reducing inference latency and cost in large language model serving.
| Optimization Feature | Speculative Decoding | Model Quantization | Continuous Batching | Operator Fusion / Kernel Optimization |
|---|---|---|---|---|
Primary Goal | Reduce autoregressive steps for the large target model | Reduce compute & memory per operation | Increase GPU utilization & throughput | Reduce kernel launch overhead & memory traffic |
Core Mechanism | Small draft model proposes tokens; large target model verifies in parallel | Lower numerical precision (e.g., FP32 -> INT8/FP16) for weights/activations | Dynamically batch requests as others finish | Fuse sequential neural network ops into a single GPU kernel |
Latency Reduction Target | Time Per Output Token (TPOT) | Prefilling & Decoding Latency | End-to-End Latency under load | Prefilling & Decoding Latency |
Throughput Impact | High (can significantly increase tokens/sec) | High | Very High (primary throughput technique) | Moderate |
Hardware Requirements | None specific; benefits from fast draft model | Requires GPU support for low-precision math (e.g., Tensor Cores) | None specific | Compiler-dependent (e.g., TensorRT, ONNX Runtime) |
Model Modification Required | Yes (requires a separate, aligned draft model) | Yes (requires quantization-aware training or post-training calibration) | No (serving system feature) | No (compiler feature applied to model graph) |
Quality/Accuracy Trade-off | None (verification ensures identical output to target model) | Potential minor degradation (quantization noise) | None | None (numerically equivalent) |
Best Suited For | Reducing cost/latency of very large models (e.g., 70B+ parameters) | General acceleration for deployed models; edge deployment | High-throughput serving scenarios with variable request lengths | Extracting maximum performance from a fixed model on specific hardware |
Implementation and Deployment Considerations
Speculative decoding introduces unique system design trade-offs. Successful deployment requires careful consideration of model pairing, resource allocation, and performance monitoring to achieve reliable latency reduction.
Draft & Target Model Selection
The core engineering choice is pairing a draft model with a target model. The draft must be significantly faster but share a similar token distribution.
- Common Pairings: Use a smaller version of the same model family (e.g., Llama 3 8B drafting for Llama 3 70B) or a heavily distilled model.
- Acceptance Rate: The target model's token acceptance rate determines speedup. A low rate negates benefits. Typical acceptance rates are 70-85% for well-matched pairs.
- Verification Forward Pass: The target model performs a single, batched forward pass over the proposed token sequence, which must be faster than generating those tokens autoregressively.
Memory & Compute Trade-offs
Speculative decoding trades increased memory bandwidth for reduced computational latency.
- KV Cache Management: Both models maintain separate Key-Value caches. The draft model's cache is computed during proposal, and the target's cache is updated during verification. Engines like vLLM with PagedAttention are essential for efficient memory use.
- Peak Memory: Hosting two models increases peak GPU memory requirements. The draft model's size is an added overhead.
- Compute Pattern: The technique shifts workload from many small, sequential decoding steps (high latency) to fewer, larger batched verification steps (higher throughput), better utilizing GPU parallel compute.
Integration with Serving Systems
Deployment requires integration into high-performance inference servers.
- Serving Engine Support: Native implementations exist in vLLM and TensorRT-LLM. Custom integration requires modifying the server's scheduling and batching logic.
- Continuous Batching Compatibility: Must work with continuous batching to maintain high throughput. The system must dynamically form batches for the verification pass from multiple, variable-length draft sequences.
- Request Lifecycle: The server must manage the two-phase execution (draft then verify) transparently, handling early rejection and correct token streaming back to the client.
Monitoring & Performance Validation
Latency gains are not guaranteed and must be rigorously measured in production.
- Key Metrics: Monitor Time to First Token (TTFT), Time Per Output Token (TPOT), and draft acceptance rate. Compare against a baseline without speculation.
- Tail Latency Impact: Evaluate P99 latency. Poor draft quality can cause volatile verification times, increasing tail latency.
- Canary Analysis: Deploy speculative decoding to a small traffic segment using canary analysis to validate latency improvements and ensure no degradation in output quality before full rollout.
Failure Modes & Fallback Logic
Systems must handle scenarios where speculative decoding provides no benefit or fails.
- Low Acceptance Sequences: For prompts where the draft model performs poorly (e.g., highly technical content), the system should have a fallback to standard autoregressive decoding to avoid slowdowns.
- Dynamic Disabling: Implement logic to dynamically disable speculation per-request or per-session based on real-time acceptance rate metrics.
- Correctness Guarantee: The verification step ensures output distribution is identical to the target model alone, but bugs in the parallel verification logic could introduce errors. Rigorous testing is required.
Hardware & Precision Optimization
Maximizing speedup involves optimizing both models for the deployment hardware.
- Quantization: Apply model quantization (e.g., to FP8 or INT8) to both draft and target models to reduce memory bandwidth pressure and increase verification speed.
- Kernel Fusion: Use compilers like TensorRT to create optimized model execution graphs with operator fusion for both models, minimizing GPU kernel launch overhead.
- Hardware Targeting: The technique benefits most from architectures with high memory bandwidth, as the verification step is memory-bound. Profile to identify if the bottleneck is compute or memory.
Frequently Asked Questions
Speculative decoding is a cutting-edge inference acceleration technique that dramatically reduces the latency of large language models by leveraging a smaller, faster model to 'draft' tokens for verification by the primary model. This section addresses common technical questions about its implementation, benefits, and trade-offs.
Speculative decoding is an inference acceleration technique where a small, fast 'draft' model proposes a sequence of tokens that are then verified in parallel by a larger, more accurate 'target' model, reducing the number of slow autoregressive steps.
The process works in three phases:
- Drafting: The small draft model runs autoregressively for
ksteps to generate a candidate sequence of tokens. - Verification: The target model processes the entire draft sequence in a single, parallel forward pass. It outputs probability distributions for each position.
- Acceptance: Starting from the first token, the system compares the draft tokens to samples drawn from the target model's distributions. Consecutive tokens are accepted until the first mismatch. The system then proceeds from the next position.
This method is effective because the verification forward pass is computationally similar to processing a single token in the standard autoregressive method, but it can validate multiple tokens at once, leading to significant speedups, often 2-3x, without altering the final output distribution.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Speculative decoding is part of a broader ecosystem of techniques and metrics for accelerating model inference. These related concepts define the performance landscape and complementary optimization strategies.
Autoregressive Decoding
The standard sequential generation process that speculative decoding aims to accelerate. In autoregressive decoding, a language model generates one token at a time, with each new token conditioned on all previously generated tokens. This creates a sequential dependency where the computation for step N cannot begin until step N-1 is complete, leading to high latency, especially for long sequences. Speculative decoding breaks this bottleneck by using a draft model to propose multiple tokens in advance for parallel verification.
PagedAttention
A memory management algorithm critical for efficiently serving the variable-length sequences involved in speculative decoding. PagedAttention treats the Key-Value (KV) cache—the memory storing previous tokens' states for the attention mechanism—like virtual memory. It divides the cache into fixed-size blocks that can be non-contiguously stored and managed, similar to pages in an operating system. This is implemented in engines like vLLM and minimizes memory fragmentation and waste when verifying speculative token sequences of unpredictable lengths, allowing for higher throughput and better GPU utilization.
Continuous Batching
A complementary throughput optimization often used alongside speculative decoding. Continuous batching (or dynamic batching) dynamically groups multiple inference requests into a single computational batch on the GPU. Unlike static batching, it adds new requests to the batch as others finish, maximizing hardware utilization. This improves overall system throughput, which is essential for making efficient use of the parallel verification stage in speculative decoding. It addresses queuing delays, while speculative decoding targets the fundamental latency of the autoregressive loop itself.
Time Per Output Token (TPOT)
The core latency metric that speculative decoding directly improves. Time Per Output Token (TPOT) is the average time required to generate each token in the output sequence after the first. In standard autoregressive decoding, TPOT is largely constant. Speculative decoding reduces the effective TPOT by verifying multiple candidate tokens in a single, parallel model call. For example, if a draft model proposes 5 tokens and the target model accepts 4, the system has generated 4 tokens in roughly the time of one verification step, significantly lowering the average TPOT.
Draft Model
The smaller, faster auxiliary model at the heart of speculative decoding. The draft model (or proposal model) is responsible for rapidly generating a sequence of candidate tokens. Key characteristics include:
- Smaller Architecture: Often a distilled or pruned version of the target model.
- Speed Focus: Optimized for low-latency, sequential generation.
- Alignment: Trained to approximate the output distribution of the larger target model to maximize acceptance rate. The draft model's accuracy-speed trade-off is critical; a too-slow draft negates benefits, while an inaccurate draft leads to low acceptance rates and wasted computation.
Target Model
The primary, accurate model whose outputs are being accelerated. The target model is the large, capable model (e.g., a 70B parameter LLM) that performs the parallel verification step in speculative decoding. It runs a single forward pass on the concatenated input prompt and the draft model's proposed token sequence. This pass produces logits for each proposed position, which are used to accept tokens that match the draft's predictions or reject and resample from the corrected distribution. The target model's parameters are never modified; the technique is a pure inference-time optimization.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us