Glossary

Decoding Latency

Decoding latency is the time consumed during the autoregressive token generation phase of inference, where each new token is produced conditioned on all previously generated tokens.

Get in touch Learn more

Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.

LATENCY BENCHMARKING

What is Decoding Latency?

Decoding latency is the critical time component during the token-by-token generation phase of an autoregressive language model's inference.

Decoding latency is the time consumed during the autoregressive token generation phase of inference, where each new output token is produced conditioned on all previously generated tokens. This phase is computationally sequential and often dominates the total inference latency for long outputs. It is primarily driven by the iterative execution of the model's decoder layers to compute logits and sample the next token, with performance heavily influenced by Key-Value (KV) cache management and memory bandwidth.

Key factors affecting decoding latency include model size, sequence length, and hardware efficiency. Optimizations like continuous batching, PagedAttention, and speculative decoding directly target reducing this latency. For streaming applications, decoding latency directly determines the Time Per Output Token (TPOT), impacting user-perceived responsiveness. It is a primary metric for evaluating the efficiency of inference-serving engines like vLLM and TensorRT-LLM.

LATENCY BENCHMARKING

Key Components of Decoding Latency

Decoding latency is the time consumed during the autoregressive token generation phase of inference. It is not a monolithic metric but the sum of several distinct, measurable sub-processes. Understanding these components is essential for systematic profiling and optimization.

Prefilling Latency

The time required for the model to perform a single, full forward pass on the static input prompt and context. This phase generates the initial Key-Value (KV) cache for the attention mechanism, which is then used during autoregressive generation. It is a one-time, upfront cost before the first output token is produced. Factors influencing prefilling latency:

Length of the input context/prompt.
Model size (parameter count).
Hardware compute and memory bandwidth.
Batch size of concurrent requests.

Time Per Output Token (TPOT)

The average latency incurred for generating each subsequent token after the first in an autoregressive sequence. This is the core iterative cost of decoding. TPOT is primarily driven by the small, sequential forward pass needed to produce the next token, which reads from and updates the KV cache. Key determinants of TPOT:

Model architecture and per-token FLOPs.
Efficiency of KV cache memory access (bandwidth-bound).
GPU kernel launch overhead for small operations.
Continuous batching efficiency, which can amortize overhead across requests.

Key-Value Cache Management

The latency overhead associated with storing and retrieving the attention key and value tensors for all previous tokens. This cache grows linearly with sequence length and is critical for avoiding recomputation. Inefficient management is a major source of decoding slowdown. Critical aspects include:

Memory allocation and fragmentation for variable-length sequences.
Memory bandwidth saturation as caches exceed GPU SRAM (L2 cache).
PagedAttention (as in vLLM), which virtualizes the KV cache to eliminate fragmentation and waste, significantly improving throughput and latency under high concurrency.

Scheduling & Continuous Batching

The latency introduced or saved by the inference scheduler's strategy for grouping and executing requests. Continuous batching (or in-flight batching) dynamically adds new requests to a running batch as others finish, maximizing GPU utilization. Scheduling impacts include:

Request queuing delay: Time a request waits before execution begins.
GPU utilization vs. latency trade-off: Larger batches increase throughput but can raise TPOT for individual requests.
Memory contention from multiple concurrent sequences sharing GPU resources.

Model Execution & Kernel Overhead

The latency from the low-level execution of neural network operations on the GPU. This involves the launch and execution of many small computational kernels. Optimizations here directly reduce TPOT:

Operator Fusion: Combining multiple sequential ops (e.g., Linear, Bias, Activation) into a single kernel to reduce memory accesses and launch overhead.
Kernel Auto-Tuning: Selecting the most efficient GPU kernel implementation for specific input sizes and hardware.
Use of optimized model execution graphs from compilers like TensorRT or ONNX Runtime, which apply these optimizations statically.

System & Framework Overhead

The ancillary latency not from the model's computation itself, but from the serving framework and system stack. This can become a significant portion of total latency, especially for short sequences or high QPS. Components include:

GPU-CPU Synchronization: Overhead from device-host memory transfers and synchronization points.
Python GIL Contention: In Python-based servers, the Global Interpreter Lock can block concurrent request handling.
gRPC/HTTP Latency: Network stack overhead for remote procedure calls, including serialization/deserialization of protocol buffers.
Token Sampling Logic: The computational cost of applying top-p, top-k, or temperature scaling to logits.

MEASUREMENT AND KEY METRICS

Decoding Latency

Decoding latency is the time consumed during the autoregressive token generation phase of inference, where each new token is produced conditioned on all previously generated tokens.

Decoding latency, also called token generation latency, is the dominant component of total inference time for large language models (LLMs). It measures the sequential delay as a model generates its output one token at a time, with each step dependent on the full history stored in the Key-Value (KV) cache. This phase is computationally intensive and memory-bandwidth bound, making its optimization critical for real-time applications like chatbots and streaming APIs.

Key factors influencing decoding latency include model size, sequence length, and hardware efficiency. Performance is profiled using metrics like Time Per Output Token (TPOT). Optimization techniques such as continuous batching, PagedAttention for efficient KV cache management, and speculative decoding are employed to reduce this latency, directly impacting user-perceived responsiveness and system throughput.

DECODING LATENCY

Primary Optimization Techniques

Decoding latency is the time consumed during the autoregressive token generation phase of inference. The following techniques are engineered to directly accelerate this sequential process, reducing time per output token (TPOT) and improving throughput.

Continuous Batching

Also known as in-flight or dynamic batching, this technique dynamically adds new inference requests to a running batch as previous requests finish generation. Unlike static batching, it eliminates idle GPU cycles by continuously feeding the compute engine, dramatically improving GPU utilization and aggregate throughput.

Key Mechanism: A scheduler manages requests of varying lengths, padding only within attention operations.
Impact: Can increase throughput by 5-10x compared to static batching under realistic, variable-load conditions.
Implementation: Found in serving engines like vLLM and TGI (Text Generation Inference).

EXPLORE

PagedAttention & KV Cache Optimization

This algorithm manages the Key-Value (KV) cache—the memory storing previous tokens' states for attention—using virtual memory concepts. It solves the problem of memory fragmentation caused by variable-length sequences.

How it works: The KV cache is partitioned into fixed-size blocks. Sequences allocate blocks non-contiguously, akin to pages in an OS.
Benefit: Enables near-optimal GPU memory utilization (over 90%), allowing larger batch sizes and longer context windows without out-of-memory errors.
Origin: Introduced by the vLLM serving engine.

EXPLORE

Speculative Decoding

An acceleration technique that uses a small, fast draft model to propose a sequence of candidate tokens. These are then verified in a single, parallel forward pass by the larger, accurate target model.

Process: If the target model accepts the draft tokens, multiple tokens are generated per slow autoregressive step.
Requirement: The draft model must be significantly faster (e.g., a distilled model) to offset the verification cost.
Result: Can reduce decoding latency by 2-3x for compatible model pairs, with identical output quality.

EXPLORE

Model Quantization

The process of reducing the numerical precision of a model's weights and activations (e.g., from FP32 to INT8 or FP16). This decreases the model's memory footprint and increases computational speed on supported hardware.

Types: Post-Training Quantization (PTQ) applies compression after training; Quantization-Aware Training (QAT) simulates quantization during training for higher accuracy.
Hardware Acceleration: INT8 precision is executed on specialized tensor cores in modern GPUs (e.g., NVIDIA Hopper).
Trade-off: Potential minor degradation in accuracy, which must be evaluated per model and task.

EXPLORE

Operator Fusion & Graph Optimization

A compiler-level optimization where consecutive neural network operations are merged into a single fused kernel. This reduces GPU kernel launch overhead and intermediate memory reads/writes.

Example: Fusing a GeLU activation function into the preceding linear layer kernel.
Tools: Inference compilers like TensorRT, ONNX Runtime, and XLA perform automatic graph optimization and fusion.
Impact: Eliminates overhead from launching dozens of small kernels, streamlining the execution graph for lower latency.

EXPLORE

FlashAttention

An IO-aware, exact attention algorithm that recomputes attention scores on-the-fly instead of storing the massive intermediate attention matrix to high-bandwidth memory (HBM).

Core Innovation: Uses tiling to keep data in fast SRAM, minimizing slow HBM accesses, which are the primary bottleneck.
Benefit: Dramatically speeds up the attention computation (2-4x) and reduces memory usage, which is critical for long-context decoding.
Extension: FlashAttention-2 further optimizes for modern GPU architectures, improving work partitioning and occupancy.

EXPLORE

DECODING LATENCY

Frequently Asked Questions

Decoding latency is the time consumed during the autoregressive token generation phase of inference, where each new token is produced conditioned on all previously generated tokens. This FAQ addresses common technical questions about its measurement, optimization, and impact on system performance.

Decoding latency is the cumulative time a language model spends generating output tokens one-by-one in an autoregressive loop. It is measured from the completion of the prefill phase (after the first token's KV cache is ready) until the final token is produced. Key metrics include Time to First Token (TTFT) and Time Per Output Token (TPOT), which together define the streaming speed of a completion. Profiling tools like the PyTorch Profiler or NVIDIA Nsight Systems are used to isolate decoding latency from other system components like network transfer or request queuing.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

LATENCY BENCHMARKING

Related Terms

Decoding latency is a critical component of the overall inference timeline. These related terms define the other key phases, metrics, and optimization techniques that determine the speed and efficiency of AI model serving.

Time to First Token (TTFT)

Time to First Token (TTFT), also known as First Token Latency, is the duration from the start of an inference request to when the first token of the output is generated or delivered to the client. This metric is crucial for perceived responsiveness in streaming applications.

Primary Driver: The prefilling latency—the forward pass through the model with the input prompt—dominates TTFT.
Key Consideration: Users perceive a system as "fast" or "slow" based on TTFT, making it a primary target for optimization.

Time Per Output Token (TPOT)

Time Per Output Token (TPOT) is the average latency incurred for generating each subsequent token after the first in an autoregressive model. It directly dictates the speed of streaming completions.

Direct Relationship: TPOT is the inverse of a system's token generation throughput.
Optimization Target: Techniques like continuous batching, PagedAttention, and speculative decoding aim to minimize TPOT by improving GPU utilization and reducing memory bottlenecks during the decoding phase.