Inferensys

Glossary

Decoding Latency

Decoding latency is the time consumed during the autoregressive token generation phase of inference, where each new token is produced conditioned on all previously generated tokens.
Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.
LATENCY BENCHMARKING

What is Decoding Latency?

Decoding latency is the critical time component during the token-by-token generation phase of an autoregressive language model's inference.

Decoding latency is the time consumed during the autoregressive token generation phase of inference, where each new output token is produced conditioned on all previously generated tokens. This phase is computationally sequential and often dominates the total inference latency for long outputs. It is primarily driven by the iterative execution of the model's decoder layers to compute logits and sample the next token, with performance heavily influenced by Key-Value (KV) cache management and memory bandwidth.

Key factors affecting decoding latency include model size, sequence length, and hardware efficiency. Optimizations like continuous batching, PagedAttention, and speculative decoding directly target reducing this latency. For streaming applications, decoding latency directly determines the Time Per Output Token (TPOT), impacting user-perceived responsiveness. It is a primary metric for evaluating the efficiency of inference-serving engines like vLLM and TensorRT-LLM.

LATENCY BENCHMARKING

Key Components of Decoding Latency

Decoding latency is the time consumed during the autoregressive token generation phase of inference. It is not a monolithic metric but the sum of several distinct, measurable sub-processes. Understanding these components is essential for systematic profiling and optimization.

01

Prefilling Latency

The time required for the model to perform a single, full forward pass on the static input prompt and context. This phase generates the initial Key-Value (KV) cache for the attention mechanism, which is then used during autoregressive generation. It is a one-time, upfront cost before the first output token is produced. Factors influencing prefilling latency:

  • Length of the input context/prompt.
  • Model size (parameter count).
  • Hardware compute and memory bandwidth.
  • Batch size of concurrent requests.
02

Time Per Output Token (TPOT)

The average latency incurred for generating each subsequent token after the first in an autoregressive sequence. This is the core iterative cost of decoding. TPOT is primarily driven by the small, sequential forward pass needed to produce the next token, which reads from and updates the KV cache. Key determinants of TPOT:

  • Model architecture and per-token FLOPs.
  • Efficiency of KV cache memory access (bandwidth-bound).
  • GPU kernel launch overhead for small operations.
  • Continuous batching efficiency, which can amortize overhead across requests.
03

Key-Value Cache Management

The latency overhead associated with storing and retrieving the attention key and value tensors for all previous tokens. This cache grows linearly with sequence length and is critical for avoiding recomputation. Inefficient management is a major source of decoding slowdown. Critical aspects include:

  • Memory allocation and fragmentation for variable-length sequences.
  • Memory bandwidth saturation as caches exceed GPU SRAM (L2 cache).
  • PagedAttention (as in vLLM), which virtualizes the KV cache to eliminate fragmentation and waste, significantly improving throughput and latency under high concurrency.
04

Scheduling & Continuous Batching

The latency introduced or saved by the inference scheduler's strategy for grouping and executing requests. Continuous batching (or in-flight batching) dynamically adds new requests to a running batch as others finish, maximizing GPU utilization. Scheduling impacts include:

  • Request queuing delay: Time a request waits before execution begins.
  • GPU utilization vs. latency trade-off: Larger batches increase throughput but can raise TPOT for individual requests.
  • Memory contention from multiple concurrent sequences sharing GPU resources.
05

Model Execution & Kernel Overhead

The latency from the low-level execution of neural network operations on the GPU. This involves the launch and execution of many small computational kernels. Optimizations here directly reduce TPOT:

  • Operator Fusion: Combining multiple sequential ops (e.g., Linear, Bias, Activation) into a single kernel to reduce memory accesses and launch overhead.
  • Kernel Auto-Tuning: Selecting the most efficient GPU kernel implementation for specific input sizes and hardware.
  • Use of optimized model execution graphs from compilers like TensorRT or ONNX Runtime, which apply these optimizations statically.
06

System & Framework Overhead

The ancillary latency not from the model's computation itself, but from the serving framework and system stack. This can become a significant portion of total latency, especially for short sequences or high QPS. Components include:

  • GPU-CPU Synchronization: Overhead from device-host memory transfers and synchronization points.
  • Python GIL Contention: In Python-based servers, the Global Interpreter Lock can block concurrent request handling.
  • gRPC/HTTP Latency: Network stack overhead for remote procedure calls, including serialization/deserialization of protocol buffers.
  • Token Sampling Logic: The computational cost of applying top-p, top-k, or temperature scaling to logits.
MEASUREMENT AND KEY METRICS

Decoding Latency

Decoding latency is the time consumed during the autoregressive token generation phase of inference, where each new token is produced conditioned on all previously generated tokens.

Decoding latency, also called token generation latency, is the dominant component of total inference time for large language models (LLMs). It measures the sequential delay as a model generates its output one token at a time, with each step dependent on the full history stored in the Key-Value (KV) cache. This phase is computationally intensive and memory-bandwidth bound, making its optimization critical for real-time applications like chatbots and streaming APIs.

Key factors influencing decoding latency include model size, sequence length, and hardware efficiency. Performance is profiled using metrics like Time Per Output Token (TPOT). Optimization techniques such as continuous batching, PagedAttention for efficient KV cache management, and speculative decoding are employed to reduce this latency, directly impacting user-perceived responsiveness and system throughput.

DECODING LATENCY

Primary Optimization Techniques

Decoding latency is the time consumed during the autoregressive token generation phase of inference. The following techniques are engineered to directly accelerate this sequential process, reducing time per output token (TPOT) and improving throughput.

DECODING LATENCY

Frequently Asked Questions

Decoding latency is the time consumed during the autoregressive token generation phase of inference, where each new token is produced conditioned on all previously generated tokens. This FAQ addresses common technical questions about its measurement, optimization, and impact on system performance.

Decoding latency is the cumulative time a language model spends generating output tokens one-by-one in an autoregressive loop. It is measured from the completion of the prefill phase (after the first token's KV cache is ready) until the final token is produced. Key metrics include Time to First Token (TTFT) and Time Per Output Token (TPOT), which together define the streaming speed of a completion. Profiling tools like the PyTorch Profiler or NVIDIA Nsight Systems are used to isolate decoding latency from other system components like network transfer or request queuing.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.