Decoding latency is the time consumed during the autoregressive token generation phase of inference, where each new output token is produced conditioned on all previously generated tokens. This phase is computationally sequential and often dominates the total inference latency for long outputs. It is primarily driven by the iterative execution of the model's decoder layers to compute logits and sample the next token, with performance heavily influenced by Key-Value (KV) cache management and memory bandwidth.
Glossary
Decoding Latency

What is Decoding Latency?
Decoding latency is the critical time component during the token-by-token generation phase of an autoregressive language model's inference.
Key factors affecting decoding latency include model size, sequence length, and hardware efficiency. Optimizations like continuous batching, PagedAttention, and speculative decoding directly target reducing this latency. For streaming applications, decoding latency directly determines the Time Per Output Token (TPOT), impacting user-perceived responsiveness. It is a primary metric for evaluating the efficiency of inference-serving engines like vLLM and TensorRT-LLM.
Key Components of Decoding Latency
Decoding latency is the time consumed during the autoregressive token generation phase of inference. It is not a monolithic metric but the sum of several distinct, measurable sub-processes. Understanding these components is essential for systematic profiling and optimization.
Prefilling Latency
The time required for the model to perform a single, full forward pass on the static input prompt and context. This phase generates the initial Key-Value (KV) cache for the attention mechanism, which is then used during autoregressive generation. It is a one-time, upfront cost before the first output token is produced. Factors influencing prefilling latency:
- Length of the input context/prompt.
- Model size (parameter count).
- Hardware compute and memory bandwidth.
- Batch size of concurrent requests.
Time Per Output Token (TPOT)
The average latency incurred for generating each subsequent token after the first in an autoregressive sequence. This is the core iterative cost of decoding. TPOT is primarily driven by the small, sequential forward pass needed to produce the next token, which reads from and updates the KV cache. Key determinants of TPOT:
- Model architecture and per-token FLOPs.
- Efficiency of KV cache memory access (bandwidth-bound).
- GPU kernel launch overhead for small operations.
- Continuous batching efficiency, which can amortize overhead across requests.
Key-Value Cache Management
The latency overhead associated with storing and retrieving the attention key and value tensors for all previous tokens. This cache grows linearly with sequence length and is critical for avoiding recomputation. Inefficient management is a major source of decoding slowdown. Critical aspects include:
- Memory allocation and fragmentation for variable-length sequences.
- Memory bandwidth saturation as caches exceed GPU SRAM (L2 cache).
- PagedAttention (as in vLLM), which virtualizes the KV cache to eliminate fragmentation and waste, significantly improving throughput and latency under high concurrency.
Scheduling & Continuous Batching
The latency introduced or saved by the inference scheduler's strategy for grouping and executing requests. Continuous batching (or in-flight batching) dynamically adds new requests to a running batch as others finish, maximizing GPU utilization. Scheduling impacts include:
- Request queuing delay: Time a request waits before execution begins.
- GPU utilization vs. latency trade-off: Larger batches increase throughput but can raise TPOT for individual requests.
- Memory contention from multiple concurrent sequences sharing GPU resources.
Model Execution & Kernel Overhead
The latency from the low-level execution of neural network operations on the GPU. This involves the launch and execution of many small computational kernels. Optimizations here directly reduce TPOT:
- Operator Fusion: Combining multiple sequential ops (e.g., Linear, Bias, Activation) into a single kernel to reduce memory accesses and launch overhead.
- Kernel Auto-Tuning: Selecting the most efficient GPU kernel implementation for specific input sizes and hardware.
- Use of optimized model execution graphs from compilers like TensorRT or ONNX Runtime, which apply these optimizations statically.
System & Framework Overhead
The ancillary latency not from the model's computation itself, but from the serving framework and system stack. This can become a significant portion of total latency, especially for short sequences or high QPS. Components include:
- GPU-CPU Synchronization: Overhead from device-host memory transfers and synchronization points.
- Python GIL Contention: In Python-based servers, the Global Interpreter Lock can block concurrent request handling.
- gRPC/HTTP Latency: Network stack overhead for remote procedure calls, including serialization/deserialization of protocol buffers.
- Token Sampling Logic: The computational cost of applying top-p, top-k, or temperature scaling to logits.
Decoding Latency
Decoding latency is the time consumed during the autoregressive token generation phase of inference, where each new token is produced conditioned on all previously generated tokens.
Decoding latency, also called token generation latency, is the dominant component of total inference time for large language models (LLMs). It measures the sequential delay as a model generates its output one token at a time, with each step dependent on the full history stored in the Key-Value (KV) cache. This phase is computationally intensive and memory-bandwidth bound, making its optimization critical for real-time applications like chatbots and streaming APIs.
Key factors influencing decoding latency include model size, sequence length, and hardware efficiency. Performance is profiled using metrics like Time Per Output Token (TPOT). Optimization techniques such as continuous batching, PagedAttention for efficient KV cache management, and speculative decoding are employed to reduce this latency, directly impacting user-perceived responsiveness and system throughput.
Primary Optimization Techniques
Decoding latency is the time consumed during the autoregressive token generation phase of inference. The following techniques are engineered to directly accelerate this sequential process, reducing time per output token (TPOT) and improving throughput.
Frequently Asked Questions
Decoding latency is the time consumed during the autoregressive token generation phase of inference, where each new token is produced conditioned on all previously generated tokens. This FAQ addresses common technical questions about its measurement, optimization, and impact on system performance.
Decoding latency is the cumulative time a language model spends generating output tokens one-by-one in an autoregressive loop. It is measured from the completion of the prefill phase (after the first token's KV cache is ready) until the final token is produced. Key metrics include Time to First Token (TTFT) and Time Per Output Token (TPOT), which together define the streaming speed of a completion. Profiling tools like the PyTorch Profiler or NVIDIA Nsight Systems are used to isolate decoding latency from other system components like network transfer or request queuing.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Decoding latency is a critical component of the overall inference timeline. These related terms define the other key phases, metrics, and optimization techniques that determine the speed and efficiency of AI model serving.
Time to First Token (TTFT)
Time to First Token (TTFT), also known as First Token Latency, is the duration from the start of an inference request to when the first token of the output is generated or delivered to the client. This metric is crucial for perceived responsiveness in streaming applications.
- Primary Driver: The prefilling latency—the forward pass through the model with the input prompt—dominates TTFT.
- Key Consideration: Users perceive a system as "fast" or "slow" based on TTFT, making it a primary target for optimization.
Time Per Output Token (TPOT)
Time Per Output Token (TPOT) is the average latency incurred for generating each subsequent token after the first in an autoregressive model. It directly dictates the speed of streaming completions.
- Direct Relationship: TPOT is the inverse of a system's token generation throughput.
- Optimization Target: Techniques like continuous batching, PagedAttention, and speculative decoding aim to minimize TPOT by improving GPU utilization and reducing memory bottlenecks during the decoding phase.
Prefilling Latency
Prefilling latency is the time required for a language model to process the static input prompt and context through its initial forward pass. This phase generates the first Key-Value (KV) cache before autoregressive token generation begins.
- Impact on TTFT: Prefilling is the major component of Time to First Token (TTFT).
- Computational Profile: This phase is compute-bound, as the entire prompt context must be processed in parallel, making it sensitive to prompt length and GPU FLOPs.
Continuous Batching
Continuous batching, also known as dynamic or in-flight batching, is an inference optimization technique where new requests are dynamically added to a running batch as previous requests finish generation.
- Core Benefit: Maximizes GPU utilization and throughput by eliminating idle time where only part of a batch is active.
- Latency Trade-off: While improving aggregate throughput, it can introduce slight request queuing delay as new requests wait for the next scheduling cycle. Engines like vLLM implement this efficiently.
PagedAttention
PagedAttention is an algorithm, introduced by the vLLM serving engine, that manages the Key-Value (KV) cache using virtual memory paging concepts.
- Solves Fragmentation: It eliminates memory waste caused by variable-length sequences, allowing non-contiguous storage of attention keys and values.
- Direct Impact: This optimization dramatically increases the number of concurrent requests a GPU can handle, thereby improving throughput and reducing decoding latency under load.
Speculative Decoding
Speculative decoding is an inference acceleration technique where a small, fast 'draft' model proposes a sequence of tokens that are then verified in parallel by a larger, more accurate 'target' model.
- Mechanism: If the target model accepts the draft tokens, multiple tokens are generated in a single, costly verification step.
- Latency Reduction: It effectively reduces the number of slow autoregressive steps the large model must perform, lowering the average Time Per Output Token (TPOT).

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us