Tokens per Second (TPS) is a throughput metric that quantifies the number of output tokens an LLM inference system can generate in one second, typically reported as a peak or sustained rate under specific hardware, batching, and sequence length configurations. It is a critical Key Performance Indicator (KPI) for production deployments, directly correlating with infrastructure cost and user experience for streaming applications. High TPS indicates efficient utilization of GPU or NPU resources, often achieved through techniques like continuous batching and optimized KV cache management.
Glossary
Tokens per Second (TPS)

What is Tokens per Second (TPS)?
A core throughput metric for evaluating the operational efficiency of large language model inference systems.
Monitoring TPS alongside latency metrics like Time to First Token (TTFT) and inter-token latency provides a complete view of system performance. Engineers track TPS to benchmark hardware, optimize inference parameters, and validate that deployments meet Service Level Objectives (SLOs). It is influenced by model architecture size, quantization level, and the efficiency of the underlying serving framework, making it a fundamental measure for LLM performance monitoring and capacity planning.
Key Factors Influencing TPS
Tokens per Second (TPS) is a critical throughput metric for LLM inference. Its value is not intrinsic to a model but is determined by a complex interplay of hardware capabilities, software optimizations, and request characteristics.
Hardware & Compute
The foundational layer determining peak TPS. Key components include:
- GPU/Accelerator: The type (e.g., H100, A100), memory bandwidth, and number of devices. Higher FLOPs and memory bandwidth directly increase token generation speed.
- Memory Capacity: The size of VRAM dictates the maximum model size and batch size that can be loaded. Insufficient memory forces costly swapping to CPU RAM, crippling TPS.
- Interconnect: The speed of links between GPUs (e.g., NVLink, InfiniBand) is crucial for multi-GPU inference, affecting how efficiently workloads are distributed.
Model Architecture & Size
The model's own design imposes fundamental constraints.
- Parameter Count: Larger models (e.g., 70B vs 7B parameters) require more computations per token, reducing TPS for a given hardware setup.
- Attention Mechanism: The complexity of the attention operation (e.g., standard vs. grouped-query or multi-query attention) significantly impacts the computational graph. Optimized variants can dramatically improve TPS.
- Precision: Using lower numerical precision (e.g., FP16, BF16, INT8, or INT4 quantization) reduces memory footprint and increases compute efficiency, often doubling or tripling TPS compared to FP32.
Inference Serving & Batching
Software orchestration of requests is paramount for achieving high aggregate TPS.
- Static vs. Continuous Batching: Static batching processes a fixed set of requests together. Continuous batching (or iteration-level batching) dynamically adds new requests to the batch as others finish, dramatically improving GPU utilization and overall system TPS.
- KV Cache Management: Efficient caching of previous tokens' Key and Value vectors in the attention mechanism avoids redundant computation. The size and management strategy of this cache directly affect memory usage and decode speed.
- Serving Framework: Specialized frameworks like vLLM, TensorRT-LLM, or TGI implement these optimizations, offering vastly higher TPS than naive model serving.
Request Characteristics
The nature of the user's input and desired output determines the workload for each request.
- Input (Prompt) Length: Longer prompts increase the time and memory required for the initial prefill phase, impacting Time to First Token (TTFT) and the efficiency of the initial batch computation.
- Output (Generation) Length: The number of tokens to be generated defines the duration of the decode phase. Longer generations increase total request latency but, with efficient batching, can improve overall system TPS by keeping GPUs saturated.
- Decoding Strategy: Greedy decoding (selecting the highest probability token) is fastest. Sampling methods (top-k, top-p, temperature) introduce minor overhead. Complex strategies like beam search multiply the computational cost, severely reducing TPS.
System & Network Overhead
Infrastructure factors outside the core model execution can become bottlenecks.
- Host CPU & I/O: The CPU must be powerful enough to handle tokenization, detokenization, and framework overhead without stalling the GPU. Slow disk I/O can delay model loading.
- Network Latency: In distributed or client-server setups, the time to transmit prompts and stream tokens adds to total response time, though it doesn't reduce the model's intrinsic TPS.
- Orchestration & Queuing: Load balancers, API gateways, and request queues add minimal but non-zero latency. Under high load, queuing delays can dominate the user-perceived latency, masking the actual TPS of the inference engine.
Measurement Context
TPS is a variable metric, and its reported value must be interpreted with context.
- Peak vs. Sustained TPS: Peak TPS is measured under ideal, saturated conditions (large batch, optimal input/output lengths). Sustained TPS reflects performance under a realistic, variable production workload and is typically lower.
- Benchmark Conditions: Always ask: Batch size? Input/Output length? Hardware spec? Model precision? Without this context, a TPS number is meaningless for comparison.
- Trade-off with Latency: Maximizing TPS often involves large batch sizes, which can increase latency for individual requests (higher TTFT, queue time). Engineering must balance throughput with latency Service Level Objectives (SLOs).
TPS vs. Other Key LLM Performance Metrics
A comparison of Tokens per Second (TPS) with other critical metrics used to monitor and evaluate the performance, efficiency, and user experience of LLM inference systems.
| Metric | Primary Focus | Key Driver | User Experience Impact | Optimization Target |
|---|---|---|---|---|
Tokens per Second (TPS) | System Throughput | Hardware FLOPs, Batch Size, Decode Efficiency | High throughput enables faster completion for batch jobs. | GPU Utilization, Continuous Batching, Kernel Optimization |
Time to First Token (TTFT) | Response Responsiveness | Prefill Computation, Model Size, Request Queueing | Perceived startup delay for a new request. | Prefill Optimization, Speculative Decoding, Model Quantization |
Inter-Token Latency | Streaming Fluency | Autoregressive Decode Cost, Memory Bandwidth | Perceived speed of text generation in a stream. | KV Cache Optimization, Attention Kernels, Memory I/O |
Latency Percentiles (P90, P99) | Tail Latency & Consistency | System Noise, Garbage Collection, Multi-tenancy Interference | Worst-case experience for a fraction of users. | Resource Isolation, Load Shedding, Request Prioritization |
Model Accuracy (e.g., Perplexity) | Output Quality | Model Architecture, Training Data, Fine-Tuning | Correctness and usefulness of the generated content. | Model Selection, Fine-Tuning, RAG Augmentation |
Error Rate | System Reliability | Infrastructure Failures, Model Loading, Input Validation | Frequency of failed requests or erroneous outputs. | Health Checks, Graceful Degradation, Retry Logic |
Cost per Token | Economic Efficiency | Hardware Cost, Cloud Pricing, Model Efficiency | Directly impacts operational budget and scalability. | Instance Selection, Spot Usage, Model Compression |
Common Techniques for Optimizing TPS
Achieving high Tokens per Second requires a multi-faceted approach targeting computational efficiency, memory bandwidth, and system architecture. These core techniques are fundamental to reducing inference cost and latency.
Continuous Batching
Continuous batching dynamically groups incoming inference requests into a single computational batch, adding new requests as others finish generation. This maximizes GPU utilization by eliminating idle time between static batches.
- Key Benefit: Dramatically improves hardware throughput compared to static batching.
- Implementation: Used in serving frameworks like vLLM and TGI (Text Generation Inference).
- Impact: Can increase effective TPS by 5-10x for workloads with variable request lengths.
KV Cache Optimization
The Key-Value (KV) Cache stores computed attention key and value vectors for previously generated tokens, preventing redundant computation during autoregressive decoding. Optimizing its management is critical for TPS.
- Memory Bottleneck: KV cache size grows linearly with batch size and sequence length, often becoming the limiting factor.
- Techniques: Include paged attention (vLLM) to eliminate memory fragmentation and quantized KV caches (e.g., FP8) to reduce memory bandwidth pressure.
- Result: Efficient KV cache management directly reduces decode latency, boosting TPS.
Model Quantization
Quantization reduces the numerical precision of a model's weights and activations (e.g., from 16-bit to 8-bit or 4-bit). This decreases memory footprint and increases compute efficiency.
- Types: Post-Training Quantization (PTQ) applies compression after training; Quantization-Aware Training (QAT) fine-tunes the model to compensate for precision loss.
- Frameworks: GPTQ, AWQ, and bitsandbytes are common libraries.
- TPS Gain: Can enable 2-4x higher batch sizes on the same hardware, proportionally increasing TPS.
Flash Attention & Kernel Fusion
Flash Attention is an optimized algorithm that computes exact attention with significantly fewer memory reads/writes (I/O operations) by fusing computation kernels. This reduces the time spent on the attention mechanism, which is often the computational bottleneck.
- Principle: Leverages hardware memory hierarchy (SRAM vs. HBM) for efficient data movement.
- Impact: Reduces prefill latency (improving TTFT) and increases overall compute throughput, contributing to higher TPS, especially for long contexts.
- Extension: FlashAttention-2 and related kernel fusions for other operations (e.g., MLP layers) provide further gains.
Speculative Decoding
Speculative decoding uses a small, fast draft model to predict a sequence of several future tokens. These are then verified in a single forward pass by the larger target model, rejecting mismatches.
- Mechanism: Amortizes the cost of the large model's forward pass over multiple tokens.
- Requirement: The draft model must have high predictive accuracy (e.g., a distilled version of the target).
- Outcome: Can achieve 2-3x latency reduction for the same model, effectively doubling or tripling TPS in supported scenarios.
Hardware-Specific Optimizations
Maximizing TPS requires tailoring the inference stack to the underlying accelerator architecture (GPU, NPU, etc.).
- Kernel Optimization: Writing custom CUDA/HIP kernels or using compiler frameworks like OpenAI Triton for optimal hardware utilization.
- Model Compilation: Using NVIDIA TensorRT or OpenXLA to compile the model graph into a single, optimized executable kernel.
- Memory Allocation: Employing efficient, pre-allocated memory pools to reduce overhead from dynamic allocation during inference.
- Result: These low-level optimizations provide the final performance layer, often yielding 20-50% TPS improvements over generic implementations.
Frequently Asked Questions
Essential questions and answers about Tokens per Second (TPS), a critical throughput metric for evaluating and optimizing the performance of large language model inference systems.
Tokens per Second (TPS) is a throughput metric that quantifies the number of output tokens a large language model (LLM) inference system can generate in one second. It is measured by benchmarking the system under a specific load, typically reporting either a peak rate (maximum achievable under ideal conditions) or a sustained rate (average over a prolonged period). Measurement requires a controlled environment with defined parameters: hardware (e.g., GPU type and count), software stack (inference server, kernels), batch size, sequence lengths, and the specific model being served. The metric is calculated as (Total Output Tokens Generated) / (Total Wall-Clock Time) for a given benchmark run.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Tokens per Second (TPS) is a core throughput metric, but it must be analyzed alongside other key performance indicators to fully understand LLM inference system behavior.
Time to First Token (TTFT)
Time to First Token measures the latency from request submission to the arrival of the first output token. This metric is dominated by the prefill stage, where the model processes the entire input prompt to compute initial logits and populate the KV Cache. High TTFT can indicate insufficient compute resources for prompt processing or inefficient batching.
- Primary Driver: Computational cost of the attention mechanism over the full input sequence.
- Key Trade-off: Often inversely related to TPS; optimizing for one can negatively impact the other.
- User Impact: Directly affects perceived responsiveness in streaming applications.
Inter-Token Latency
Inter-Token Latency is the average time interval between the generation of consecutive output tokens during the autoregressive decode stage. It is the reciprocal of Tokens per Second (TPS = 1 / Inter-Token Latency). This metric determines the fluency of a streaming response.
- Primary Driver: Memory bandwidth for reading the KV Cache and compute for generating the next token.
- Optimization Target: Techniques like continuous batching and optimized attention kernels aim to minimize this latency.
- Monitoring Context: Should be tracked via latency percentiles (P50, P90, P99) to understand tail performance and variability.
Continuous Batching
Continuous Batching is an inference optimization technique that dynamically adds new requests to a running batch as previous requests finish generation, instead of waiting for the entire batch to complete. This dramatically improves GPU utilization and overall system throughput (TPS).
- Mechanism: The scheduler manages a pool of active requests, adding and removing them from the computational batch in real-time.
- Impact on TPS: Can increase effective TPS by 5-10x compared to static batching by eliminating idle GPU cycles.
- Consideration: Increases implementation complexity for the inference server to manage variable-length sequences and partial completions.
KV Cache
The Key-Value (KV) Cache is a critical memory structure in transformer-based LLM inference. It stores the computed key and value vectors for all previous tokens in a sequence, allowing the attention mechanism to reference past context without recalculating it for each new token.
- Purpose: Eliminates redundant computation during autoregressive decoding, drastically speeding up token generation.
- Trade-off: Consumes significant GPU memory (often multiple GBs per request), which can limit batch size and throughput.
- Optimization: Techniques like paged attention and quantization of the KV cache are used to manage memory footprint and improve TPS.
Service Level Indicators (SLIs) & Objectives (SLOs)
Service Level Indicators (SLIs) are quantitative measures of an LLM service's performance, such as average TPS, P99 latency, or error rate. A Service Level Objective (SLO) is a target value for an SLI that defines acceptable service performance.
- Example SLO: "99% of requests shall have a Time to First Token below 500ms and a sustained TPS above 45 tokens/second."
- Error Budget: The allowable amount of SLO violation over a period, used to govern release velocity and prioritization of reliability work.
- Foundation: Essential for defining and monitoring production-grade LLM performance beyond isolated TPS measurements.
Latency Percentiles (P50, P90, P99)
Latency Percentiles describe the distribution of response times. The P50 (median), P90, and P99 latencies represent the maximum latency experienced by 50%, 90%, and 99% of requests, respectively. They are crucial for understanding tail latency.
- Importance for TPS: Average TPS can mask poor user experience. A high P99 latency for Inter-Token Latency means 1% of users experience very slow streaming, even if average TPS is good.
- Monitoring: Requires high-cardinality metric collection to segment by model, hardware, and request parameters.
- Use Case: Informing capacity planning and autoscaling decisions to meet SLOs under variable load.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us