Time Per Output Token (TPOT) is the average latency incurred for generating each subsequent token after the first in an autoregressive model, directly measuring the speed of the decoding phase. It is calculated by dividing the total time spent generating tokens (excluding the Time to First Token) by the number of tokens produced. This metric is critical for understanding the user-perceived speed of streaming completions, as it dictates the pace at which text appears after the initial response.
Glossary
Time Per Output Token (TPOT)

What is Time Per Output Token (TPOT)?
A core metric for evaluating the streaming performance of autoregressive language models.
TPOT is primarily driven by the computational cost of the model's autoregressive forward pass and is heavily influenced by factors like model size, KV cache memory bandwidth, and GPU kernel launch overhead. It exists on the throughput-latency curve, often trading off against higher queries per second (QPS) under techniques like continuous batching. Engineers optimize TPOT through methods such as model quantization, operator fusion, and advanced decoding algorithms like speculative decoding to improve real-time interactivity.
Key Components of TPOT
Time Per Output Token (TPOT) measures the speed of streaming text generation. It is a critical metric for user-perceived performance in conversational and interactive AI applications.
Core Definition & Formula
Time Per Output Token (TPOT) is the average latency incurred for generating each subsequent token after the first in an autoregressive language model. It is distinct from Time to First Token (TTFT), which measures initial responsiveness.
- Formula: TPOT = (Total Generation Time - TTFT) / (Number of Output Tokens - 1).
- Primary Driver: The autoregressive decoding loop, where each new token is generated conditioned on all previous tokens.
- Key Impact: Directly determines the speed of streaming completions, affecting user experience in chat interfaces, code generation, and real-time assistants.
Relationship to Other Latency Metrics
TPOT must be analyzed within a hierarchy of latency metrics to diagnose system performance fully.
- Time to First Token (TTFT): Measures initial 'thinking' time, dominated by prompt processing (prefill). TPOT governs the speed after this initial delay.
- End-to-End Latency: The total user-perceived time. For long outputs, TPOT is the dominant component:
E2E Latency ≈ TTFT + (Output Length * TPOT). - Throughput-Latency Trade-off: Systems optimized for high Queries Per Second (QPS) often batch requests, which can increase individual request TPOT due to interleaved execution.
Technical Determinants & Bottlenecks
TPOT is governed by a pipeline of compute and memory operations. Common bottlenecks include:
- GPU Kernel Execution: The time for the forward pass of the decoder layer to produce logits for the next token.
- Memory Bandwidth: The 'memory wall'—loading model weights and the growing Key-Value (KV) Cache for attention from GPU memory is often the limiting factor.
- Sampling Overhead: Operations like top-p/top-k filtering and random number generation for token selection.
- GPU Kernel Launch Overhead: Latency from scheduling many small operations, especially problematic for small batch sizes.
- Host-Device Synchronization: Unnecessary CPU waits for GPU results can inflate measured TPOT.
Optimization Techniques
Reducing TPOT is a primary goal of inference optimization. Key techniques include:
- Continuous Batching: Dynamically batching requests as others finish, maximizing GPU utilization and amortizing memory bandwidth costs across tokens.
- PagedAttention (vLLM): Manages the KV cache using virtual memory concepts, eliminating fragmentation and waste, allowing larger effective batch sizes.
- Model Quantization: Using lower precision (e.g., FP16, INT8) for weights and activations reduces memory bandwidth pressure and accelerates compute.
- Operator Fusion & Optimized Kernels: Compilers like TensorRT fuse sequential operations (e.g., Linear + GeLU) into single GPU kernels to reduce launch overhead.
- Speculative Decoding: Uses a small draft model to propose multiple tokens verified in parallel by the main model, reducing the number of slow autoregressive steps.
Measurement & Profiling
Accurate TPOT measurement requires isolating the decoding phase from system noise.
- Profiling Tools: Use PyTorch Profiler, NVIDIA Nsight Systems, or vLLM's built-in metrics to trace the autoregressive loop.
- Isolating Variables: Measure TPOT across different output lengths and batch sizes to generate a performance profile.
- Distinguishing Components: Advanced profiling can break down TPOT into compute, memory, and sampling latencies.
- Load Testing: TPOT typically increases under higher concurrent request load due to resource contention; measure under realistic load.
SLOs & Performance Baselines
TPOT targets are defined as Service Level Objectives (SLOs) to guarantee user experience.
- Setting Targets: SLOs are often defined for a percentile (e.g., P95 TPOT < 75ms/token) under expected load.
- Establishing Baselines: A performance baseline documents TPOT for a specific model, hardware, and batch size configuration.
- Canary Analysis: New model versions or optimizations are deployed to a small traffic subset, and their TPOT is compared against the baseline before full rollout.
- Bottleneck Identification: When TPOT violates SLOs, profiling identifies the limiting resource (compute, memory bandwidth) to guide remediation.
How TPOT is Measured and Calculated
Time Per Output Token (TPOT) is a critical performance metric for autoregressive models, quantifying the speed of text generation after the initial response.
Time Per Output Token (TPOT) is the average latency incurred for generating each subsequent token after the first in an autoregressive model, directly impacting the speed of streaming completions. It is calculated by measuring the total decoding latency—the time from the generation of the first token to the generation of the final token—and dividing it by the number of tokens generated in that interval (total output tokens minus one). This isolates the incremental cost of the autoregressive loop from the initial prefilling latency.
Accurate TPOT measurement requires a stable, isolated environment to avoid interference from request queuing delay or system noise. Profiling tools capture timestamps for each generated token. The metric is highly sensitive to payload size (output length), model architecture, and hardware efficiency. Engineers use TPOT to optimize continuous batching strategies and evaluate the impact of techniques like speculative decoding or model quantization on generation throughput for real-time applications.
TPOT vs. Other Key Latency Metrics
A comparison of Time Per Output Token (TPOT) with other critical latency metrics used to profile and optimize inference performance.
| Metric | Definition | Primary Driver | Key Impact | Typical Optimization Target |
|---|---|---|---|---|
Time Per Output Token (TPOT) | Average latency to generate each subsequent token after the first in an autoregressive model. | Decoding compute, memory bandwidth for KV cache. | Speed of streaming completions, user-perceived fluidity. | GPU kernel efficiency, KV cache memory management (e.g., PagedAttention). |
Time to First Token (TTFT) | Duration from request start to the generation/delivery of the first output token. | Prefilling compute, model loading (cold start), initial prompt processing. | Perceived responsiveness, user engagement onset. | Prefilling latency reduction, continuous batching, cold start mitigation. |
End-to-End Latency | Total elapsed time from client request initiation to receipt of complete response. | Network transmission, queuing, TTFT, TPOT, serialization. | Overall user experience, task completion time. | System-wide optimization, reducing queuing delays, network overhead. |
Tail Latency (P95/P99) | High-percentile (e.g., 95th/99th) response times representing the slowest requests. | Resource contention, garbage collection, straggler requests, variable input lengths. | Worst-case user experience, system reliability and stability. | Load balancing, request scheduling, mitigating outliers, robust autoscaling. |
Throughput (QPS) at Latency SLO | Number of queries per second a system can process while meeting a latency Service Level Objective. | Batch size, hardware utilization (GPU/CPU), efficient scheduling. | System capacity and cost-efficiency at a defined performance guarantee. | Finding the optimal throughput-latency curve operating point via continuous batching. |
Cold Start Latency | Additional delay for the first request(s) to an unloaded model. | Model loading from disk, initialization, and cache warming. | Service readiness after deployment or scaling, user experience for infrequent endpoints. | Model quantization, faster storage, pre-warming strategies, keeping models resident. |
Prefilling Latency | Time to process the static input prompt through the model's forward pass. | Prompt length, model architecture (attention complexity), compute hardware. | Delay before token generation can begin, part of TTFT. | Optimized attention implementations for long contexts, operator fusion. |
Decoding Latency | Time consumed during the autoregressive token generation phase. | Autoregressive step computation, KV cache memory access, generated sequence length. | Directly correlates with TPOT; the core latency of text generation. | Speculative decoding, optimized decoding kernels, attention optimization. |
Frequently Asked Questions
Time Per Output Token (TPOT) is a critical latency metric for autoregressive models like LLMs, measuring the speed of streaming text generation after the initial prompt processing.
Time Per Output Token (TPOT) is the average latency incurred for generating each subsequent token after the first in an autoregressive language model. It directly measures the speed of the decoding phase, where the model produces a sequence one token at a time, with each new token conditioned on all previous ones. This metric is distinct from Time to First Token (TTFT), which measures the initial processing delay. TPOT is the primary determinant of perceived speed in streaming applications like chatbots, as it governs the rate at which text appears to the user. It is typically measured in milliseconds per token (ms/tok).
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Time Per Output Token (TPOT) is a core metric for evaluating the speed of text generation. It exists within a broader ecosystem of performance indicators and optimization techniques. The following terms are essential for a complete understanding of inference latency profiling.
Time to First Token (TTFT)
Time to First Token (TTFT), also called First Token Latency, is the duration from the start of an inference request to when the first token of the output is generated or delivered. It is the critical metric for perceived responsiveness in streaming applications.
- Measures initial processing: Includes prefilling latency (processing the static prompt) and the first autoregressive step.
- Key user experience signal: A low TTFT makes an application feel instant, even if subsequent tokens arrive more slowly.
- Contrast with TPOT: TTFT measures the start of generation, while TPOT measures the sustained speed of generation after the first token.
End-to-End Latency
End-to-End Latency is the total elapsed time from the moment a client initiates a request until the complete, final response is received and usable. It is the ultimate measure of system responsiveness from the user's perspective.
- Holistic measurement: Encompasses network transmission, server-side queuing, full model execution (TTFT + TPOT * output length), and any post-processing.
- Critical for SLOs: Service Level Objectives (SLOs) for user-facing applications are typically defined on end-to-end latency percentiles (e.g., P99 < 2 seconds).
- Broader than TPOT: TPOT is a major component of the server-side computation portion of end-to-end latency.
Prefilling Latency
Prefilling Latency is the time required for a language model to process the static input prompt and context through its initial forward pass, generating the first Key-Value (KV) cache before autoregressive token generation begins.
- A component of TTFT: Prefilling latency is the dominant part of TTFT for long input contexts.
- Compute-bound operation: Involves a full, parallel forward pass through the model's layers for the entire input sequence.
- Impact on TPOT: Efficient prefilling and KV cache creation sets up the efficient, memory-bound decoding phase that determines TPOT.
Decoding Latency
Decoding Latency is the total time consumed during the autoregressive token generation phase of inference, where each new token is produced conditioned on all previously generated tokens. TPOT is the average decoding latency per token.
- Memory-bound phase: Dominated by reading the large KV cache from high-bandwidth memory (HBM) for each layer, as compute per token is relatively small.
- Impact of sequence length: Latency can increase as the KV cache grows, unless managed by techniques like PagedAttention.
- Primary target for optimization: Techniques like speculative decoding, continuous batching, and optimized kernels aim to reduce decoding latency/improve TPOT.
Throughput-Latency Curve
A Throughput-Latency Curve is a graph that plots the relationship between a system's request throughput (e.g., Queries Per Second) and its corresponding average or tail latency. It defines the fundamental trade-off in inference serving systems.
- Identifies optimal operating point: Shows the maximum throughput achievable before latency degrades exponentially due to queuing.
- TPOT's role: A lower TPOT shifts the entire curve, allowing higher throughput at the same latency target, or lower latency at the same throughput.
- Critical for capacity planning: Used to determine the required hardware to meet a specific Service Level Objective (SLO) for a predicted load.
Continuous Batching
Continuous Batching, also known as dynamic or in-flight batching, is an inference optimization technique where new requests are dynamically added to a running batch on the GPU as previous requests finish generation. It is essential for achieving high throughput and low TPOT in production.
- Maximizes GPU utilization: Eliminates idle time by keeping the computational units constantly occupied, unlike static batching.
- Directly improves TPOT: By maintaining a larger, more consistent batch size for the decoding phase, it amortizes fixed overheads and improves hardware efficiency.
- Core to modern servers: Implemented in engines like vLLM and NVIDIA's TensorRT-LLM to serve variable-length requests efficiently.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us