Tokens Per Second (TPS) is a throughput metric that quantifies the raw inference speed of a language model or AI agent by measuring the number of output tokens it can generate per second. It is a direct indicator of a model's computational efficiency and a key factor in determining the end-to-end latency and perceived responsiveness of an AI application. For agentic observability, TPS is a fundamental telemetry signal used to monitor system health and detect performance bottlenecks in production.
Glossary
Tokens Per Second (TPS)

What is Tokens Per Second (TPS)?
A core throughput metric for generative AI systems.
In practical terms, TPS is influenced by hardware (e.g., GPU type), model architecture, and inference optimization techniques like continuous batching. While high TPS indicates fast raw generation, it must be analyzed alongside Time to First Token (TTFT) and tail latency to fully understand user experience. For agent performance benchmarking, TPS is critical for capacity planning, cost per thousand tokens estimation, and ensuring systems meet defined Service Level Objectives (SLOs) for throughput.
Key Factors Influencing TPS
Tokens Per Second (TPS) is a critical throughput metric for AI inference. Its value is not intrinsic to a model but is determined by a complex interplay of hardware, software, and system design choices.
Model Architecture & Size
The fundamental design of the language model is the primary determinant of TPS.
- Parameter Count: Larger models (e.g., 70B+ parameters) require more computations per token, reducing TPS compared to smaller models (e.g., 7B parameters) on identical hardware.
- Architecture Family: Architectures like Mixture of Experts (MoE) can achieve higher TPS than dense models of equivalent parameter count by activating only a subset of weights per token.
- Context Window Length: Processing very long sequences increases memory bandwidth pressure for the Key-Value (KV) Cache, which can bottleneck TPS.
Hardware & Compute
The physical infrastructure executing the model defines the upper bound for TPS.
- GPU/Accelerator Type: Modern AI accelerators (e.g., NVIDIA H100, Google TPU v5e) with specialized tensor cores provide vastly higher TPS than general-purpose CPUs.
- Memory Bandwidth: The speed at which model weights can be read from GPU VRAM (High Bandwidth Memory) is often the limiting factor for TPS, a phenomenon known as being memory-bound.
- Batching Strategy: Continuous batching dynamically groups requests, maximizing GPU utilization and significantly improving aggregate TPS over static batching.
Inference Optimization
Software-level optimizations are essential for achieving peak hardware performance.
- Quantization: Reducing model weight precision from 16-bit (FP16) to 8-bit (INT8) or 4-bit (NF4) halves or quarters memory bandwidth requirements, directly boosting TPS with minimal accuracy loss.
- Kernel Optimization: Using highly optimized, low-level compute kernels (e.g., via FlashAttention) reduces the operational overhead of attention mechanisms.
- Compiler Techniques: Frameworks like TensorRT-LLM or vLLM apply graph optimizations, kernel fusion, and expert memory management to maximize TPS.
System & Network Overhead
The surrounding serving infrastructure introduces latency that impacts effective TPS.
- Pre/Post-Processing: Tokenization, detokenization, and output formatting add fixed latency per request, reducing overall system TPS.
- Network Latency: In distributed systems, communication between orchestrators, model servers, and tokenizers adds delay.
- Queuing & Scheduling: Under high load, request queuing and scheduler overhead can become the bottleneck, capping realized TPS below the hardware's theoretical maximum.
Decoding Strategy
The algorithm used to generate tokens directly controls the number of serial inference steps required.
- Greedy Decoding: Produces one token per forward pass. It is the fastest (highest TPS) but can lead to repetitive or low-quality output.
- Sampling (Top-k, Top-p): Introduces randomness by sampling from the probability distribution, maintaining quality with a negligible TPS impact versus greedy decoding.
- Beam Search: Explores multiple sequence possibilities in parallel, requiring
ktimes more computations (wherekis the beam width), drastically reducing TPS.
Related Metric: Time to First Token (TTFT)
While TPS measures steady-state throughput, Time to First Token (TTFT) measures the latency to start streaming. They are often in tension.
- High TPS, High TTFT: Systems optimized for large batch processing may have high TPS but also high TTFT as they wait to fill a batch.
- Low TPS, Low TTFT: Systems prioritizing responsiveness may process requests immediately (low TTFT) but sacrifice aggregate TPS due to poor GPU utilization.
- Optimization Goal: The ideal system balances both, using techniques like continuous batching and prefill-decoding separation to minimize TTFT while maximizing TPS.
TPS vs. Latency Metrics: A Comparison
A comparison of throughput and latency metrics used to evaluate the performance of AI agents and language models, highlighting their distinct roles in benchmarking.
| Metric | Tokens Per Second (TPS) | End-to-End Latency | Time to First Token (TTFT) |
|---|---|---|---|
Primary Measurement | Throughput (output rate) | Total request duration | Initial response delay |
Key Performance Indicator For | Inference server efficiency, hardware utilization | Overall user experience, task completion time | Perceived responsiveness, streaming applications |
Typical Unit | Tokens/second | Milliseconds (ms) or seconds (s) | Milliseconds (ms) |
Impacted By | Batch size, model architecture, GPU memory bandwidth | Network latency, model compute time, external API calls, queuing | Model prefill computation, context length, cold starts |
Relationship to Concurrency | Often increases with higher batch sizes up to a saturation point | Generally increases with higher concurrency due to queuing | Less directly affected by concurrency than total latency |
Optimization Target | Maximize throughput for cost-efficient batch processing | Minimize latency for interactive, real-time applications | Minimize delay to start of stream for conversational agents |
Use in SLOs/SLIs | For cost and capacity planning (e.g., min TPS under load) | For user experience guarantees (e.g., P99 latency < 2s) | For streaming quality (e.g., TTFT < 500ms) |
Directly Measures | Raw computational speed of token generation | Holistic system performance from request to final output | Time to begin delivering the output stream |
Frequently Asked Questions
Essential questions and answers about Tokens Per Second (TPS), a core throughput metric for measuring the raw inference speed of language models and AI agents in production.
Tokens Per Second (TPS) is a throughput metric that quantifies the number of output tokens a language model or AI agent can generate per second, indicating its raw inference speed. It is measured by dividing the total number of tokens in a generated output sequence by the wall-clock time taken to produce that sequence, excluding initial prompt processing and network overhead. High-throughput inference engines use techniques like continuous batching to aggregate multiple requests, maximizing GPU utilization and TPS. This metric is distinct from end-user perceived latency (like Time to First Token) and is a critical benchmark for infrastructure cost and scalability.
Key Measurement Contexts:
- Peak TPS: Maximum throughput under optimal, saturated load.
- Sustained TPS: Average throughput over a prolonged period, accounting for system variability.
- It is typically measured server-side on the inference hardware.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Tokens Per Second (TPS) is a core throughput metric, but understanding agent performance requires a holistic view of complementary measurements for latency, cost, accuracy, and system health.
Time to First Token (TTFT)
Time to First Token measures the latency from sending a request to a generative model until the client receives the first output token. This is a critical user-perceived latency metric, distinct from overall throughput.
- Primary Driver: Initial model processing and prompt context loading.
- Impact: Directly affects user experience for interactive applications. High TTFT makes agents feel unresponsive.
- Relationship to TPS: A system can have high TPS (good steady-state throughput) but suffer from high TTFT if there are long initialization or queuing delays.
End-to-End Latency
End-to-End Latency is the total duration of a complete user interaction with an AI agent, from initial input to final, actionable output. It encompasses all processing stages.
- Components: Includes TTFT, inter-token latency (1/TPS), agent reasoning time, external tool call execution, and network overhead.
- Holistic View: The ultimate measure of agent responsiveness from a user's perspective.
- Optimization Focus: Requires profiling the entire pipeline, as improving just TPS may not significantly reduce end-to-end time if other stages (e.g., planning, API calls) are the bottleneck.
Throughput (Requests Per Second)
Throughput, often measured in Requests Per Second (RPS), is the rate at which a system completes entire requests. For AI agents, this is a higher-level metric than TPS.
- System-Level Metric: Measures completed agent tasks per second, where a task may involve multiple LLM calls and tool executions.
- Concurrency Dependency: Maximum RPS is achieved by processing many requests simultaneously via techniques like continuous batching.
- Trade-off with Latency: Increasing RPS (handling more concurrent requests) often increases per-request latency (P50, P99) due to resource contention.
Cost Per Thousand Tokens
Cost Per Thousand Tokens (CPT) is the standardized pricing metric for cloud LLM inference. Performance optimization must balance speed (TPS) with cost efficiency.
- Direct Link to TPS: Higher TPS can reduce cost-per-request by completing work faster, but aggressive optimization for TPS (e.g., larger batch sizes) may increase cost-per-token if it reduces hardware utilization efficiency.
- Telemetry Integration: Agent cost telemetry pipelines track token consumption across sessions to attribute expenses and calculate return on investment.
- Engineering Decision: Choosing a model or optimization technique involves evaluating the TPS/CPT trade-off for a given latency SLO.
Concurrency Level
Concurrency Level is the number of simultaneous requests or user sessions a system is processing. It is a key load parameter that directly impacts observed TPS and latency.
- Dynamic Scaling: Systems auto-scale to maintain TPS/RPS as concurrency increases, up to a saturation point.
- Performance Testing: Load tests ramp up concurrency to find the point where latency degrades or errors rise, defining system limits.
- Batching Efficiency: Higher concurrency enables more efficient continuous batching in the inference engine, improving aggregate TPS but potentially increasing individual request latency.
Service Level Objective (SLO)
A Service Level Objective is a target for a reliability or performance metric, such as "P99 latency < 2 seconds" or "TPS > 1000." SLOs define the expected experience for agent systems.
- SLO Composition: Often includes a latency target (e.g., end-to-end or TTFT) and a throughput target (TPS/RPS).
- Error Budget: The allowable violation of an SLO, used to govern release risk. A drop in TPS below target consumes the error budget.
- Observability Requirement: Meeting SLOs requires comprehensive agent telemetry pipelines to measure TPS, latency, and errors in real-time.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us