Inferensys

Glossary

Time Per Output Token (TPOT)

Time Per Output Token (TPOT) is the average latency incurred for generating each subsequent token after the first in an autoregressive model, directly impacting the speed of streaming completions.
Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.
LATENCY BENCHMARKING

What is Time Per Output Token (TPOT)?

A core metric for evaluating the streaming performance of autoregressive language models.

Time Per Output Token (TPOT) is the average latency incurred for generating each subsequent token after the first in an autoregressive model, directly measuring the speed of the decoding phase. It is calculated by dividing the total time spent generating tokens (excluding the Time to First Token) by the number of tokens produced. This metric is critical for understanding the user-perceived speed of streaming completions, as it dictates the pace at which text appears after the initial response.

TPOT is primarily driven by the computational cost of the model's autoregressive forward pass and is heavily influenced by factors like model size, KV cache memory bandwidth, and GPU kernel launch overhead. It exists on the throughput-latency curve, often trading off against higher queries per second (QPS) under techniques like continuous batching. Engineers optimize TPOT through methods such as model quantization, operator fusion, and advanced decoding algorithms like speculative decoding to improve real-time interactivity.

LATENCY BENCHMARKING

Key Components of TPOT

Time Per Output Token (TPOT) measures the speed of streaming text generation. It is a critical metric for user-perceived performance in conversational and interactive AI applications.

01

Core Definition & Formula

Time Per Output Token (TPOT) is the average latency incurred for generating each subsequent token after the first in an autoregressive language model. It is distinct from Time to First Token (TTFT), which measures initial responsiveness.

  • Formula: TPOT = (Total Generation Time - TTFT) / (Number of Output Tokens - 1).
  • Primary Driver: The autoregressive decoding loop, where each new token is generated conditioned on all previous tokens.
  • Key Impact: Directly determines the speed of streaming completions, affecting user experience in chat interfaces, code generation, and real-time assistants.
02

Relationship to Other Latency Metrics

TPOT must be analyzed within a hierarchy of latency metrics to diagnose system performance fully.

  • Time to First Token (TTFT): Measures initial 'thinking' time, dominated by prompt processing (prefill). TPOT governs the speed after this initial delay.
  • End-to-End Latency: The total user-perceived time. For long outputs, TPOT is the dominant component: E2E Latency ≈ TTFT + (Output Length * TPOT).
  • Throughput-Latency Trade-off: Systems optimized for high Queries Per Second (QPS) often batch requests, which can increase individual request TPOT due to interleaved execution.
03

Technical Determinants & Bottlenecks

TPOT is governed by a pipeline of compute and memory operations. Common bottlenecks include:

  • GPU Kernel Execution: The time for the forward pass of the decoder layer to produce logits for the next token.
  • Memory Bandwidth: The 'memory wall'—loading model weights and the growing Key-Value (KV) Cache for attention from GPU memory is often the limiting factor.
  • Sampling Overhead: Operations like top-p/top-k filtering and random number generation for token selection.
  • GPU Kernel Launch Overhead: Latency from scheduling many small operations, especially problematic for small batch sizes.
  • Host-Device Synchronization: Unnecessary CPU waits for GPU results can inflate measured TPOT.
04

Optimization Techniques

Reducing TPOT is a primary goal of inference optimization. Key techniques include:

  • Continuous Batching: Dynamically batching requests as others finish, maximizing GPU utilization and amortizing memory bandwidth costs across tokens.
  • PagedAttention (vLLM): Manages the KV cache using virtual memory concepts, eliminating fragmentation and waste, allowing larger effective batch sizes.
  • Model Quantization: Using lower precision (e.g., FP16, INT8) for weights and activations reduces memory bandwidth pressure and accelerates compute.
  • Operator Fusion & Optimized Kernels: Compilers like TensorRT fuse sequential operations (e.g., Linear + GeLU) into single GPU kernels to reduce launch overhead.
  • Speculative Decoding: Uses a small draft model to propose multiple tokens verified in parallel by the main model, reducing the number of slow autoregressive steps.
05

Measurement & Profiling

Accurate TPOT measurement requires isolating the decoding phase from system noise.

  • Profiling Tools: Use PyTorch Profiler, NVIDIA Nsight Systems, or vLLM's built-in metrics to trace the autoregressive loop.
  • Isolating Variables: Measure TPOT across different output lengths and batch sizes to generate a performance profile.
  • Distinguishing Components: Advanced profiling can break down TPOT into compute, memory, and sampling latencies.
  • Load Testing: TPOT typically increases under higher concurrent request load due to resource contention; measure under realistic load.
06

SLOs & Performance Baselines

TPOT targets are defined as Service Level Objectives (SLOs) to guarantee user experience.

  • Setting Targets: SLOs are often defined for a percentile (e.g., P95 TPOT < 75ms/token) under expected load.
  • Establishing Baselines: A performance baseline documents TPOT for a specific model, hardware, and batch size configuration.
  • Canary Analysis: New model versions or optimizations are deployed to a small traffic subset, and their TPOT is compared against the baseline before full rollout.
  • Bottleneck Identification: When TPOT violates SLOs, profiling identifies the limiting resource (compute, memory bandwidth) to guide remediation.
LATENCY BENCHMARKING

How TPOT is Measured and Calculated

Time Per Output Token (TPOT) is a critical performance metric for autoregressive models, quantifying the speed of text generation after the initial response.

Time Per Output Token (TPOT) is the average latency incurred for generating each subsequent token after the first in an autoregressive model, directly impacting the speed of streaming completions. It is calculated by measuring the total decoding latency—the time from the generation of the first token to the generation of the final token—and dividing it by the number of tokens generated in that interval (total output tokens minus one). This isolates the incremental cost of the autoregressive loop from the initial prefilling latency.

Accurate TPOT measurement requires a stable, isolated environment to avoid interference from request queuing delay or system noise. Profiling tools capture timestamps for each generated token. The metric is highly sensitive to payload size (output length), model architecture, and hardware efficiency. Engineers use TPOT to optimize continuous batching strategies and evaluate the impact of techniques like speculative decoding or model quantization on generation throughput for real-time applications.

LATENCY METRIC COMPARISON

TPOT vs. Other Key Latency Metrics

A comparison of Time Per Output Token (TPOT) with other critical latency metrics used to profile and optimize inference performance.

MetricDefinitionPrimary DriverKey ImpactTypical Optimization Target

Time Per Output Token (TPOT)

Average latency to generate each subsequent token after the first in an autoregressive model.

Decoding compute, memory bandwidth for KV cache.

Speed of streaming completions, user-perceived fluidity.

GPU kernel efficiency, KV cache memory management (e.g., PagedAttention).

Time to First Token (TTFT)

Duration from request start to the generation/delivery of the first output token.

Prefilling compute, model loading (cold start), initial prompt processing.

Perceived responsiveness, user engagement onset.

Prefilling latency reduction, continuous batching, cold start mitigation.

End-to-End Latency

Total elapsed time from client request initiation to receipt of complete response.

Network transmission, queuing, TTFT, TPOT, serialization.

Overall user experience, task completion time.

System-wide optimization, reducing queuing delays, network overhead.

Tail Latency (P95/P99)

High-percentile (e.g., 95th/99th) response times representing the slowest requests.

Resource contention, garbage collection, straggler requests, variable input lengths.

Worst-case user experience, system reliability and stability.

Load balancing, request scheduling, mitigating outliers, robust autoscaling.

Throughput (QPS) at Latency SLO

Number of queries per second a system can process while meeting a latency Service Level Objective.

Batch size, hardware utilization (GPU/CPU), efficient scheduling.

System capacity and cost-efficiency at a defined performance guarantee.

Finding the optimal throughput-latency curve operating point via continuous batching.

Cold Start Latency

Additional delay for the first request(s) to an unloaded model.

Model loading from disk, initialization, and cache warming.

Service readiness after deployment or scaling, user experience for infrequent endpoints.

Model quantization, faster storage, pre-warming strategies, keeping models resident.

Prefilling Latency

Time to process the static input prompt through the model's forward pass.

Prompt length, model architecture (attention complexity), compute hardware.

Delay before token generation can begin, part of TTFT.

Optimized attention implementations for long contexts, operator fusion.

Decoding Latency

Time consumed during the autoregressive token generation phase.

Autoregressive step computation, KV cache memory access, generated sequence length.

Directly correlates with TPOT; the core latency of text generation.

Speculative decoding, optimized decoding kernels, attention optimization.

TIME PER OUTPUT TOKEN (TPOT)

Frequently Asked Questions

Time Per Output Token (TPOT) is a critical latency metric for autoregressive models like LLMs, measuring the speed of streaming text generation after the initial prompt processing.

Time Per Output Token (TPOT) is the average latency incurred for generating each subsequent token after the first in an autoregressive language model. It directly measures the speed of the decoding phase, where the model produces a sequence one token at a time, with each new token conditioned on all previous ones. This metric is distinct from Time to First Token (TTFT), which measures the initial processing delay. TPOT is the primary determinant of perceived speed in streaming applications like chatbots, as it governs the rate at which text appears to the user. It is typically measured in milliseconds per token (ms/tok).

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.