Inferensys

Glossary

Throughput

Throughput is the rate at which an AI agent or system successfully processes requests, typically measured in requests per second (RPS) or tokens per second (TPS).
Developer building agentic RAG system, retrieval pipeline diagram on laptop, technical workspace with notes.
AGENT PERFORMANCE METRIC

What is Throughput?

Throughput is a fundamental metric for quantifying the processing capacity of AI systems under load.

Throughput is the rate at which an AI system successfully processes requests, typically measured in requests per second (RPS) for agentic systems or tokens per second (TPS) for language model inference. It quantifies a system's capacity to handle concurrent load, directly impacting scalability and cost-efficiency. High throughput indicates an architecture capable of serving many users or tasks simultaneously without significant degradation in end-to-end latency.

In agentic observability, throughput is analyzed alongside latency and resource utilization to identify performance bottlenecks and define Service Level Objectives (SLOs). It is critical for capacity planning and load testing, as exceeding a system's saturation point causes latency to spike and throughput to drop. Optimizing throughput often involves techniques like continuous batching and efficient concurrency management.

AGENT PERFORMANCE BENCHMARKING

Key Throughput Metrics

Throughput quantifies the processing capacity of an AI system. These are the primary metrics used to measure and benchmark the rate of successful request handling.

01

Requests Per Second (RPS)

Requests Per Second (RPS) is the foundational throughput metric, measuring the number of successful client requests an AI serving endpoint can process each second. It is the inverse of average latency. High RPS indicates a system capable of handling significant user load.

  • Calculation: RPS = (Total Successful Requests) / (Measurement Window in Seconds)
  • Key Consideration: RPS must be reported alongside latency percentiles (e.g., P95) to be meaningful, as a high RPS with poor tail latency indicates an unstable system.
  • Example: An agentic workflow endpoint sustaining 500 RPS with a P99 latency under 2 seconds demonstrates robust capacity for concurrent user interactions.
02

Tokens Per Second (TPS)

Tokens Per Second (TPS) measures the raw text generation speed of a language model, critical for agent response times and streaming user experiences. It is a lower-level metric than RPS, focusing on the model's inference engine.

  • Components: TPS is often broken into prefill throughput (processing the input prompt) and decode throughput (generating the output tokens). Decode throughput is typically slower.
  • Benchmarking: TPS is heavily dependent on model architecture, hardware (GPU/TPU), batch size, and sequence length. For example, a Llama 3 70B model might achieve 50 TPS on an H100 GPU with specific optimization.
  • Impact on Agents: Low TPS directly increases an agent's Time to First Token (TTFT) and End-to-End Latency, slowing down multi-turn reasoning loops.
03

Concurrent Sessions

Concurrent Sessions measures the number of simultaneous, stateful user interactions an agent system can maintain without degradation in per-session latency or success rate. It is a capacity metric for interactive applications.

  • Vs. RPS: While RPS measures request rate, Concurrent Sessions measures sustained stateful load. A chat agent may handle 1000 RPS but only support 100 concurrent sessions if each session involves long-running memory and context.
  • System Design Driver: This metric dictates requirements for context caching, session memory management, and connection pooling. Exceeding the supported concurrent sessions leads to context eviction errors or timeout failures.
  • Monitoring: Tracked alongside Session Duration and Agent State size to understand resource pressure.
04

Tool Calls Per Second

Tool Calls Per Second quantifies the rate at which an AI agent successfully executes external API calls or function calls. This measures the integration throughput of the agent's action layer.

  • Bottleneck Identification: This metric often reveals bottlenecks outside the LLM itself, such as slow external APIs, database latency, or authentication overhead. A low rate here can throttle overall agent throughput regardless of high TPS.
  • Instrumentation: Requires detailed Tool Call Instrumentation to capture latency, success/failure status, and error types for each external dependency.
  • Example: An e-commerce agent might have a TPS of 100 but a Tool Calls Per Second of 5 due to a slow inventory API, making the external service the system's Saturation Point.
05

Saturation & Degradation Curves

A Saturation Curve is a graph plotting throughput (RPS/TPS) against increasing load (Concurrency Level) to identify the point where performance degrades. It is essential for capacity planning.

  • Knee of the Curve: The point where latency begins to increase exponentially while throughput plateaus. Operating beyond this Saturation Point is unsustainable.
  • Degradation Signature: The curve shows how a system fails—gracefully (latency increases) or catastrophically (error rate spikes). Agentic systems with complex dependencies often fail catastrophically.
  • Use Case: Used to define Service Level Objectives (SLOs) and Error Budgets. For instance, an SLO may state the system must maintain P99 latency under 3s up to 80% of its saturation throughput.
06

Throughput vs. Latency Trade-off

The Throughput-Latency Trade-off is a fundamental engineering principle: increasing throughput (e.g., via batching) typically increases latency for individual requests, and vice-versa.

  • Batching: Processing multiple requests together improves GPU utilization and Tokens Per Second (TPS) but adds queuing delay, harming Time to First Token (TTFT) for individual users.
  • Optimization Strategies: Techniques like continuous batching (or iteration-level batching) aim to optimize this trade-off by dynamically grouping requests.
  • Agentic Impact: For interactive agents, low latency is often prioritized over maximum throughput. The optimal operating point is where latency SLOs are met while maximizing efficient resource use, avoiding over-provisioning.
AGENT PERFORMANCE

Factors Impacting Throughput

Throughput, the rate of successful request processing, is governed by a complex interplay of system architecture, resource constraints, and workload characteristics.

Throughput is primarily constrained by compute-bound operations like neural network inference and I/O-bound operations such as retrieving context from a vector database. Key hardware factors include GPU memory bandwidth, vRAM capacity for model weights, and CPU speed for pre/post-processing. Network latency and bandwidth between distributed microservices further limit the achievable requests per second (RPS).

Software architecture critically determines throughput efficiency. Techniques like continuous batching, which groups multiple requests for parallel execution, and optimized KV cache management dramatically improve tokens per second (TPS). The concurrency level of simultaneous requests must be balanced against system resources to avoid queuing delays that degrade throughput. Finally, the complexity of the agent's reasoning loops and the frequency of external tool calls directly increase processing time per request.

THROUGHPUT

Frequently Asked Questions

Throughput is a foundational performance metric for AI systems, quantifying their capacity to handle work. These questions address its definition, measurement, optimization, and relationship to other critical observability signals.

Throughput is the rate at which an AI agent or system successfully processes and completes requests, measured over a specific time interval. It is the primary metric for quantifying a system's capacity and efficiency under load. For language models, throughput is often expressed in Tokens Per Second (TPS), indicating how many output tokens the model can generate across all concurrent requests. For agentic systems, it may be measured in Requests Per Second (RPS) or Tasks Per Second, encompassing the full agent lifecycle of planning, tool execution, and response generation. High throughput indicates a system can handle a larger volume of work, directly impacting scalability and cost-effectiveness. It is a key Service Level Indicator (SLI) for engineering leaders defining performance Service Level Objectives (SLOs).

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.