Glossary

Tokens Per Second (TPS)

Tokens Per Second (TPS) is a throughput metric that quantifies the number of output tokens a language model or AI agent can generate per second, indicating raw inference speed and system capacity.

Get in touch Learn more

MLOps engineer reviewing model serving infrastructure on laptop, container orchestration visible, technical workspace.

AGENT PERFORMANCE METRIC

What is Tokens Per Second (TPS)?

A core throughput metric for generative AI systems.

Tokens Per Second (TPS) is a throughput metric that quantifies the raw inference speed of a language model or AI agent by measuring the number of output tokens it can generate per second. It is a direct indicator of a model's computational efficiency and a key factor in determining the end-to-end latency and perceived responsiveness of an AI application. For agentic observability, TPS is a fundamental telemetry signal used to monitor system health and detect performance bottlenecks in production.

In practical terms, TPS is influenced by hardware (e.g., GPU type), model architecture, and inference optimization techniques like continuous batching. While high TPS indicates fast raw generation, it must be analyzed alongside Time to First Token (TTFT) and tail latency to fully understand user experience. For agent performance benchmarking, TPS is critical for capacity planning, cost per thousand tokens estimation, and ensuring systems meet defined Service Level Objectives (SLOs) for throughput.

SYSTEM ARCHITECTURE

Key Factors Influencing TPS

Tokens Per Second (TPS) is a critical throughput metric for AI inference. Its value is not intrinsic to a model but is determined by a complex interplay of hardware, software, and system design choices.

Model Architecture & Size

The fundamental design of the language model is the primary determinant of TPS.

Parameter Count: Larger models (e.g., 70B+ parameters) require more computations per token, reducing TPS compared to smaller models (e.g., 7B parameters) on identical hardware.
Architecture Family: Architectures like Mixture of Experts (MoE) can achieve higher TPS than dense models of equivalent parameter count by activating only a subset of weights per token.
Context Window Length: Processing very long sequences increases memory bandwidth pressure for the Key-Value (KV) Cache, which can bottleneck TPS.

Hardware & Compute

The physical infrastructure executing the model defines the upper bound for TPS.

GPU/Accelerator Type: Modern AI accelerators (e.g., NVIDIA H100, Google TPU v5e) with specialized tensor cores provide vastly higher TPS than general-purpose CPUs.
Memory Bandwidth: The speed at which model weights can be read from GPU VRAM (High Bandwidth Memory) is often the limiting factor for TPS, a phenomenon known as being memory-bound.
Batching Strategy: Continuous batching dynamically groups requests, maximizing GPU utilization and significantly improving aggregate TPS over static batching.

Inference Optimization

Software-level optimizations are essential for achieving peak hardware performance.

Quantization: Reducing model weight precision from 16-bit (FP16) to 8-bit (INT8) or 4-bit (NF4) halves or quarters memory bandwidth requirements, directly boosting TPS with minimal accuracy loss.
Kernel Optimization: Using highly optimized, low-level compute kernels (e.g., via FlashAttention) reduces the operational overhead of attention mechanisms.
Compiler Techniques: Frameworks like TensorRT-LLM or vLLM apply graph optimizations, kernel fusion, and expert memory management to maximize TPS.

System & Network Overhead

The surrounding serving infrastructure introduces latency that impacts effective TPS.

Pre/Post-Processing: Tokenization, detokenization, and output formatting add fixed latency per request, reducing overall system TPS.
Network Latency: In distributed systems, communication between orchestrators, model servers, and tokenizers adds delay.
Queuing & Scheduling: Under high load, request queuing and scheduler overhead can become the bottleneck, capping realized TPS below the hardware's theoretical maximum.

Decoding Strategy

The algorithm used to generate tokens directly controls the number of serial inference steps required.

Greedy Decoding: Produces one token per forward pass. It is the fastest (highest TPS) but can lead to repetitive or low-quality output.
Sampling (Top-k, Top-p): Introduces randomness by sampling from the probability distribution, maintaining quality with a negligible TPS impact versus greedy decoding.
Beam Search: Explores multiple sequence possibilities in parallel, requiring k times more computations (where k is the beam width), drastically reducing TPS.

Related Metric: Time to First Token (TTFT)

While TPS measures steady-state throughput, Time to First Token (TTFT) measures the latency to start streaming. They are often in tension.

High TPS, High TTFT: Systems optimized for large batch processing may have high TPS but also high TTFT as they wait to fill a batch.
Low TPS, Low TTFT: Systems prioritizing responsiveness may process requests immediately (low TTFT) but sacrifice aggregate TPS due to poor GPU utilization.
Optimization Goal: The ideal system balances both, using techniques like continuous batching and prefill-decoding separation to minimize TTFT while maximizing TPS.

AGENT PERFORMANCE METRICS

TPS vs. Latency Metrics: A Comparison

A comparison of throughput and latency metrics used to evaluate the performance of AI agents and language models, highlighting their distinct roles in benchmarking.

Metric	Tokens Per Second (TPS)	End-to-End Latency	Time to First Token (TTFT)
Primary Measurement	Throughput (output rate)	Total request duration	Initial response delay
Key Performance Indicator For	Inference server efficiency, hardware utilization	Overall user experience, task completion time	Perceived responsiveness, streaming applications
Typical Unit	Tokens/second	Milliseconds (ms) or seconds (s)	Milliseconds (ms)
Impacted By	Batch size, model architecture, GPU memory bandwidth	Network latency, model compute time, external API calls, queuing	Model prefill computation, context length, cold starts
Relationship to Concurrency	Often increases with higher batch sizes up to a saturation point	Generally increases with higher concurrency due to queuing	Less directly affected by concurrency than total latency
Optimization Target	Maximize throughput for cost-efficient batch processing	Minimize latency for interactive, real-time applications	Minimize delay to start of stream for conversational agents
Use in SLOs/SLIs	For cost and capacity planning (e.g., min TPS under load)	For user experience guarantees (e.g., P99 latency < 2s)	For streaming quality (e.g., TTFT < 500ms)
Directly Measures	Raw computational speed of token generation	Holistic system performance from request to final output	Time to begin delivering the output stream

AGENT PERFORMANCE BENCHMARKING

Frequently Asked Questions

Essential questions and answers about Tokens Per Second (TPS), a core throughput metric for measuring the raw inference speed of language models and AI agents in production.

Tokens Per Second (TPS) is a throughput metric that quantifies the number of output tokens a language model or AI agent can generate per second, indicating its raw inference speed. It is measured by dividing the total number of tokens in a generated output sequence by the wall-clock time taken to produce that sequence, excluding initial prompt processing and network overhead. High-throughput inference engines use techniques like continuous batching to aggregate multiple requests, maximizing GPU utilization and TPS. This metric is distinct from end-user perceived latency (like Time to First Token) and is a critical benchmark for infrastructure cost and scalability.

Key Measurement Contexts:

Peak TPS: Maximum throughput under optimal, saturated load.
Sustained TPS: Average throughput over a prolonged period, accounting for system variability.
It is typically measured server-side on the inference hardware.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

AGENT PERFORMANCE METRICS

Related Terms

Tokens Per Second (TPS) is a core throughput metric, but understanding agent performance requires a holistic view of complementary measurements for latency, cost, accuracy, and system health.

Time to First Token (TTFT)

Time to First Token measures the latency from sending a request to a generative model until the client receives the first output token. This is a critical user-perceived latency metric, distinct from overall throughput.

Primary Driver: Initial model processing and prompt context loading.
Impact: Directly affects user experience for interactive applications. High TTFT makes agents feel unresponsive.
Relationship to TPS: A system can have high TPS (good steady-state throughput) but suffer from high TTFT if there are long initialization or queuing delays.

End-to-End Latency

End-to-End Latency is the total duration of a complete user interaction with an AI agent, from initial input to final, actionable output. It encompasses all processing stages.

Components: Includes TTFT, inter-token latency (1/TPS), agent reasoning time, external tool call execution, and network overhead.
Holistic View: The ultimate measure of agent responsiveness from a user's perspective.
Optimization Focus: Requires profiling the entire pipeline, as improving just TPS may not significantly reduce end-to-end time if other stages (e.g., planning, API calls) are the bottleneck.

Throughput (Requests Per Second)

Throughput, often measured in Requests Per Second (RPS), is the rate at which a system completes entire requests. For AI agents, this is a higher-level metric than TPS.

System-Level Metric: Measures completed agent tasks per second, where a task may involve multiple LLM calls and tool executions.
Concurrency Dependency: Maximum RPS is achieved by processing many requests simultaneously via techniques like continuous batching.
Trade-off with Latency: Increasing RPS (handling more concurrent requests) often increases per-request latency (P50, P99) due to resource contention.

Cost Per Thousand Tokens

Cost Per Thousand Tokens (CPT) is the standardized pricing metric for cloud LLM inference. Performance optimization must balance speed (TPS) with cost efficiency.

Direct Link to TPS: Higher TPS can reduce cost-per-request by completing work faster, but aggressive optimization for TPS (e.g., larger batch sizes) may increase cost-per-token if it reduces hardware utilization efficiency.
Telemetry Integration: Agent cost telemetry pipelines track token consumption across sessions to attribute expenses and calculate return on investment.
Engineering Decision: Choosing a model or optimization technique involves evaluating the TPS/CPT trade-off for a given latency SLO.

Concurrency Level

Concurrency Level is the number of simultaneous requests or user sessions a system is processing. It is a key load parameter that directly impacts observed TPS and latency.

Dynamic Scaling: Systems auto-scale to maintain TPS/RPS as concurrency increases, up to a saturation point.
Performance Testing: Load tests ramp up concurrency to find the point where latency degrades or errors rise, defining system limits.
Batching Efficiency: Higher concurrency enables more efficient continuous batching in the inference engine, improving aggregate TPS but potentially increasing individual request latency.

Service Level Objective (SLO)

A Service Level Objective is a target for a reliability or performance metric, such as "P99 latency < 2 seconds" or "TPS > 1000." SLOs define the expected experience for agent systems.

SLO Composition: Often includes a latency target (e.g., end-to-end or TTFT) and a throughput target (TPS/RPS).
Error Budget: The allowable violation of an SLO, used to govern release risk. A drop in TPS below target consumes the error budget.
Observability Requirement: Meeting SLOs requires comprehensive agent telemetry pipelines to measure TPS, latency, and errors in real-time.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.