Throughput is the rate at which an AI system successfully processes requests, typically measured in requests per second (RPS) for agentic systems or tokens per second (TPS) for language model inference. It quantifies a system's capacity to handle concurrent load, directly impacting scalability and cost-efficiency. High throughput indicates an architecture capable of serving many users or tasks simultaneously without significant degradation in end-to-end latency.
Glossary
Throughput

What is Throughput?
Throughput is a fundamental metric for quantifying the processing capacity of AI systems under load.
In agentic observability, throughput is analyzed alongside latency and resource utilization to identify performance bottlenecks and define Service Level Objectives (SLOs). It is critical for capacity planning and load testing, as exceeding a system's saturation point causes latency to spike and throughput to drop. Optimizing throughput often involves techniques like continuous batching and efficient concurrency management.
Key Throughput Metrics
Throughput quantifies the processing capacity of an AI system. These are the primary metrics used to measure and benchmark the rate of successful request handling.
Requests Per Second (RPS)
Requests Per Second (RPS) is the foundational throughput metric, measuring the number of successful client requests an AI serving endpoint can process each second. It is the inverse of average latency. High RPS indicates a system capable of handling significant user load.
- Calculation:
RPS = (Total Successful Requests) / (Measurement Window in Seconds) - Key Consideration: RPS must be reported alongside latency percentiles (e.g., P95) to be meaningful, as a high RPS with poor tail latency indicates an unstable system.
- Example: An agentic workflow endpoint sustaining 500 RPS with a P99 latency under 2 seconds demonstrates robust capacity for concurrent user interactions.
Tokens Per Second (TPS)
Tokens Per Second (TPS) measures the raw text generation speed of a language model, critical for agent response times and streaming user experiences. It is a lower-level metric than RPS, focusing on the model's inference engine.
- Components: TPS is often broken into prefill throughput (processing the input prompt) and decode throughput (generating the output tokens). Decode throughput is typically slower.
- Benchmarking: TPS is heavily dependent on model architecture, hardware (GPU/TPU), batch size, and sequence length. For example, a Llama 3 70B model might achieve 50 TPS on an H100 GPU with specific optimization.
- Impact on Agents: Low TPS directly increases an agent's Time to First Token (TTFT) and End-to-End Latency, slowing down multi-turn reasoning loops.
Concurrent Sessions
Concurrent Sessions measures the number of simultaneous, stateful user interactions an agent system can maintain without degradation in per-session latency or success rate. It is a capacity metric for interactive applications.
- Vs. RPS: While RPS measures request rate, Concurrent Sessions measures sustained stateful load. A chat agent may handle 1000 RPS but only support 100 concurrent sessions if each session involves long-running memory and context.
- System Design Driver: This metric dictates requirements for context caching, session memory management, and connection pooling. Exceeding the supported concurrent sessions leads to context eviction errors or timeout failures.
- Monitoring: Tracked alongside Session Duration and Agent State size to understand resource pressure.
Tool Calls Per Second
Tool Calls Per Second quantifies the rate at which an AI agent successfully executes external API calls or function calls. This measures the integration throughput of the agent's action layer.
- Bottleneck Identification: This metric often reveals bottlenecks outside the LLM itself, such as slow external APIs, database latency, or authentication overhead. A low rate here can throttle overall agent throughput regardless of high TPS.
- Instrumentation: Requires detailed Tool Call Instrumentation to capture latency, success/failure status, and error types for each external dependency.
- Example: An e-commerce agent might have a TPS of 100 but a Tool Calls Per Second of 5 due to a slow inventory API, making the external service the system's Saturation Point.
Saturation & Degradation Curves
A Saturation Curve is a graph plotting throughput (RPS/TPS) against increasing load (Concurrency Level) to identify the point where performance degrades. It is essential for capacity planning.
- Knee of the Curve: The point where latency begins to increase exponentially while throughput plateaus. Operating beyond this Saturation Point is unsustainable.
- Degradation Signature: The curve shows how a system fails—gracefully (latency increases) or catastrophically (error rate spikes). Agentic systems with complex dependencies often fail catastrophically.
- Use Case: Used to define Service Level Objectives (SLOs) and Error Budgets. For instance, an SLO may state the system must maintain P99 latency under 3s up to 80% of its saturation throughput.
Throughput vs. Latency Trade-off
The Throughput-Latency Trade-off is a fundamental engineering principle: increasing throughput (e.g., via batching) typically increases latency for individual requests, and vice-versa.
- Batching: Processing multiple requests together improves GPU utilization and Tokens Per Second (TPS) but adds queuing delay, harming Time to First Token (TTFT) for individual users.
- Optimization Strategies: Techniques like continuous batching (or iteration-level batching) aim to optimize this trade-off by dynamically grouping requests.
- Agentic Impact: For interactive agents, low latency is often prioritized over maximum throughput. The optimal operating point is where latency SLOs are met while maximizing efficient resource use, avoiding over-provisioning.
Factors Impacting Throughput
Throughput, the rate of successful request processing, is governed by a complex interplay of system architecture, resource constraints, and workload characteristics.
Throughput is primarily constrained by compute-bound operations like neural network inference and I/O-bound operations such as retrieving context from a vector database. Key hardware factors include GPU memory bandwidth, vRAM capacity for model weights, and CPU speed for pre/post-processing. Network latency and bandwidth between distributed microservices further limit the achievable requests per second (RPS).
Software architecture critically determines throughput efficiency. Techniques like continuous batching, which groups multiple requests for parallel execution, and optimized KV cache management dramatically improve tokens per second (TPS). The concurrency level of simultaneous requests must be balanced against system resources to avoid queuing delays that degrade throughput. Finally, the complexity of the agent's reasoning loops and the frequency of external tool calls directly increase processing time per request.
Frequently Asked Questions
Throughput is a foundational performance metric for AI systems, quantifying their capacity to handle work. These questions address its definition, measurement, optimization, and relationship to other critical observability signals.
Throughput is the rate at which an AI agent or system successfully processes and completes requests, measured over a specific time interval. It is the primary metric for quantifying a system's capacity and efficiency under load. For language models, throughput is often expressed in Tokens Per Second (TPS), indicating how many output tokens the model can generate across all concurrent requests. For agentic systems, it may be measured in Requests Per Second (RPS) or Tasks Per Second, encompassing the full agent lifecycle of planning, tool execution, and response generation. High throughput indicates a system can handle a larger volume of work, directly impacting scalability and cost-effectiveness. It is a key Service Level Indicator (SLI) for engineering leaders defining performance Service Level Objectives (SLOs).
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Throughput is a core system performance metric, but it must be evaluated alongside other critical measurements to fully understand an agent's operational health and efficiency.
Latency
Latency is the total time delay between the initiation of a request to an AI agent and the completion of its response. It is the inverse perspective of throughput: while throughput measures volume over time, latency measures the time per individual request.
- Components: Includes processing time (inference), network transmission, and queuing delays.
- Trade-off: Often exists in tension with throughput; optimizing for one can negatively impact the other.
- Critical for: User experience, real-time applications, and interactive agent sessions.
Tokens Per Second (TPS)
Tokens Per Second is a granular throughput metric specific to language models, quantifying the number of output tokens generated per second. It is a key driver of overall agent request throughput for text-based tasks.
- Direct Measurement: Indicates the raw inference speed of the underlying language model.
- Factors: Heavily influenced by model architecture, hardware (GPU/TPU), and optimization techniques like continuous batching.
- Usage: Used for capacity planning and cost estimation of LLM-powered agents.
Concurrency Level
Concurrency Level is the number of simultaneous requests or user sessions an AI serving system is actively processing at a given moment. It is a primary determinant of achievable throughput.
- System Capacity: Defines the maximum parallel workload the system can handle.
- Load Testing: Increased concurrency is used in stress tests to find the system's saturation point.
- Architecture Impact: Efficiently handling high concurrency requires techniques like dynamic batching and non-blocking I/O.
Saturation Point
The Saturation Point is the level of concurrent load at which an AI system's performance begins to degrade non-linearly. It represents the practical limit of throughput before quality of service collapses.
- Identification: Marked by a sharp increase in latency (e.g., tail latency P99) and/or error rates.
- Operational Guardrail: Defines the safe operating zone for production traffic.
- Bottleneck Revelation: Reaching saturation exposes the system's limiting resource, such as GPU memory or database connections.
Resource Utilization
Resource Utilization measures the percentage of available system resources—such as GPU, CPU, or memory—consumed by an AI workload. High throughput must be analyzed alongside utilization to gauge efficiency.
- Efficiency Metric: High throughput with low resource utilization indicates a well-optimized system.
- Bottleneck Identification: A resource at 100% utilization is often the system's performance bottleneck limiting further throughput gains.
- Cost Correlation: Directly ties to the infrastructure cost component of the Total Cost of Ownership (TCO).
Service Level Objective (SLO)
A Service Level Objective is a target value or range for a Service Level Indicator (SLI), such as throughput or latency, that defines the expected reliability of an AI system. Throughput SLOs ensure capacity meets business demand.
- Example: "The agent service shall maintain a throughput of 50 RPS at the P99 percentile."
- Error Budget: The allowable deviation from the SLO, used to manage risk and schedule improvements.
- Engineering Driver: SLOs for throughput directly inform autoscaling policies and capacity procurement.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us