Glossary

Throughput-Latency Curve

A throughput-latency curve is a graph that plots the relationship between a system's request throughput (e.g., queries per second) and its corresponding average or tail latency, used to identify the optimal operating point before performance degradation.

Get in touch Learn more

Developer building agentic RAG system, retrieval pipeline diagram on laptop, technical workspace with notes.

PERFORMANCE ANALYSIS

What is a Throughput-Latency Curve?

A throughput-latency curve is a fundamental graph used in system performance analysis to visualize the trade-off between a service's capacity and its responsiveness under load.

A throughput-latency curve is a graph that plots the relationship between a system's achieved throughput (e.g., queries per second) and its corresponding average or tail latency, revealing the performance envelope and optimal operating point before severe degradation. This curve is generated by applying a steadily increasing load to a system—such as a model inference server—and measuring the resulting latency at each throughput level. The characteristic shape typically shows latency remaining low and stable until a saturation point, after which queuing effects dominate and latency increases exponentially while throughput plateaus.

For AI inference services, this curve is critical for capacity planning and setting Service Level Objectives (SLOs). It identifies the maximum sustainable throughput before violating latency targets (e.g., P99 < 200ms). The knee of the curve represents the optimal operating region, balancing resource utilization with acceptable responsiveness. Performance regressions, optimization gains, or the impact of techniques like continuous batching and model quantization are quantitatively validated by shifts in this curve, making it an essential tool for latency benchmarking and infrastructure engineering.

PERFORMANCE ANALYSIS

Key Characteristics of the Curve

A throughput-latency curve is a fundamental tool for characterizing the performance envelope of an inference serving system. It reveals the trade-off between processing capacity and responsiveness under load.

The Knee Point

The knee point (or elbow) of the curve is the critical operational region where latency begins to increase non-linearly with added throughput. It represents the maximum sustainable throughput before queuing delays dominate.

Definition: The inflection point where the slope of the latency curve increases sharply.
Operational Significance: Running near, but not beyond, the knee point maximizes resource utilization while maintaining acceptable latency SLOs.
Identification: Found by incrementally increasing request load (QPS) and observing where average or P95 latency deviates from a near-constant baseline.

Saturation & Degradation

Beyond the knee point, the system enters saturation. Latency increases exponentially as the request arrival rate exceeds the system's maximum service rate, causing unbounded queue growth.

Queuing Theory: Modeled by M/M/1 or M/G/1 queues; latency asymptotically approaches infinity as throughput nears theoretical maximum.
Symptoms: Rapid growth in P99 tail latency, increased error rates, and potential system instability.
Cause: Resource exhaustion (GPU, CPU, memory bandwidth) or scheduler limits, where request queuing delay becomes the primary latency component.

Optimal Operating Region

The optimal operating region is the plateau to the left of the knee, where latency is stable and predictable for a given range of throughput. This is the target for production SLOs.

Characteristics: Latency is minimally affected by small load fluctuations. Throughput scales nearly linearly with added concurrent requests.
Engineering Goal: Size autoscaling policies and load balancers to keep the system within this region under expected traffic patterns.
Determination: Established via load testing to map the curve, then setting throughput limits at 70-90% of the knee point value for a safety margin.

Impact of Batching

Batching strategies dramatically reshape the throughput-latency curve. Static batching creates a step-function, while continuous batching (e.g., in vLLM) produces a smoother, more efficient curve.

Static Batching: Fixed batch size trades off higher latency for higher throughput; the curve shows discrete performance points for each batch size.
Continuous/Dynamic Batching: New requests are added to running batches as others finish. This maximizes GPU utilization, pushing the knee point to a higher throughput for the same latency, creating a superior Pareto frontier.
Visual Cue: A family of curves, each for a different batching configuration, is used to select the optimal strategy.

Inference Configuration Effects

Model-serving parameters and hardware choices translate directly to shifts in the curve's position. Optimizations move the entire curve favorably.

Quantization (FP16/INT8): Reduces compute and memory bandwidth needs, often lowering the latency axis (faster responses) and sometimes raising the throughput knee point.
KV Cache Optimization (PagedAttention): Reduces memory fragmentation, allowing larger effective batch sizes, which shifts the knee point to the right (higher throughput).
Compiler Optimizations (TensorRT): Operator fusion and kernel tuning reduce GPU kernel launch overhead, improving both latency and throughput, lifting the entire curve.
Speculative Decoding: Reduces time per output token (TPOT), flattening the latency curve, especially for longer generations.

Practical Measurement & Use

Generating an accurate curve requires controlled load testing that mimics production traffic patterns, including variable input lengths and request distributions.

Tooling: Use load-testing frameworks (e.g., Locust, Vegeta) or specialized ML serving benchmarks (e.g., MLPerf Inference).
Procedure: Ramp up request rate (QPS) in steps, measuring average and percentile latencies (P50, P95, P99) at each step until saturation is clear.
Analysis for SLOs: The curve directly informs Service Level Objective (SLO) definition. For example, an SLO of 'P99 latency < 1s' corresponds to a maximum allowable throughput read directly from the curve.
Capacity Planning: The curve dictates the number of instances needed to handle a target peak QPS while remaining in the optimal region.

OPERATIONAL REGIMES

Regions of a Throughput-Latency Curve

This table defines the distinct performance regimes observed when plotting a system's throughput against its latency, used to identify the optimal operating point and predict degradation.

Region	Throughput	Latency Behavior	System State	Primary Bottleneck
Underloaded	Low to Moderate	Constant & Low	Idle resources available; requests processed immediately.	None (Client-limited)
Elastic	Increasing Linearly	Stable, Slight Increase	Resources fully utilized; queue begins to form.	GPU Compute / Batch Processing
Knee Point (Optimal)	Peak Sustainable	Latency starts rising non-linearly	Maximum efficient throughput before severe queuing.	GPU Memory Bandwidth / Scheduler
Saturated	Plateaus	Rapidly Increasing	Queue growing faster than processing rate; latency highly variable.	Request Queuing Delay
Overloaded	Degrades	Unbounded Increase	System unstable; requests may time out or fail.	CPU Scheduler / Memory Thrashing
Characteristic Metric	Queries Per Second (QPS)	P50, P95, P99 Latency	Concurrent Requests in Flight	GPU Utilization, Queue Depth

LATENCY BENCHMARKING

Practical Applications in AI Systems

The throughput-latency curve is a fundamental tool for capacity planning and performance optimization in production AI systems. These cards detail its critical applications for infrastructure engineers and CTOs.

Capacity Planning & SLO Definition

The curve provides the empirical data needed to define Service Level Objectives (SLOs) for latency and throughput. Engineers use it to answer:

What is the maximum sustainable QPS while keeping P99 latency under 200ms?
How many replicas are required to handle a forecasted peak load of X requests per second?
Where is the 'knee' of the curve beyond which latency degrades exponentially? This identifies the safe operating limit before provisioning additional resources.

Cost-Performance Optimization

This application directly addresses the CTO's mandate for infrastructure cost control. The curve reveals the optimal operating point that balances throughput (cost efficiency) with acceptable latency (user experience).

Identify diminishing returns: Adding more concurrent requests may yield minimal throughput gains while drastically increasing tail latency.
Right-size instances: Determine if a system is CPU-bound, GPU-bound, or memory-bandwidth-bound, guiding hardware selection (e.g., fewer high-memory instances vs. more smaller instances).
Evaluate optimization ROI: Quantify the latency/throughput improvement from techniques like continuous batching or model quantization to justify engineering investment.

Bottleneck Identification & Profiling

The shape of the throughput-latency curve points to specific system bottlenecks. A sharp latency increase at low load suggests a serial bottleneck (e.g., single-threaded preprocessing). A gradual curve that plateaus indicates a saturated hardware resource (e.g., GPU compute or memory bandwidth).

Flat latency, then spike: Often indicates queue saturation after a fixed number of parallel workers (e.g., GPU threads) are occupied.
Compare curves: Generate separate curves for different components (e.g., prefill phase vs. decode phase) using profiling tools like PyTorch Profiler or NVIDIA Nsight to isolate the slowest stage.

Load Testing & Regression Detection

A baseline throughput-latency curve serves as a contract for system performance. It is used during canary analysis and load testing to detect regressions.

Pre-deployment validation: A new model version or server configuration must meet or improve upon the baseline curve.
Automated regression testing: Performance tests can fail if latency at target QPS exceeds the baseline by a defined margin (e.g., 10%).
Understanding autoscaling lag: The curve shows the performance penalty incurred during the delay between a traffic spike and new resource provisioning.

Architecture Selection & Comparison

The curve is the primary metric for comparing different inference engines, hardware, and model optimizations. Engineers generate curves for each candidate to make data-driven decisions.

vLLM vs. TensorRT: Compare throughput at target latency for the same model.
Synchronous vs. Asynchronous inference: The curve differs significantly; async APIs can offer higher throughput but complicate latency measurement and client-side logic.
Quantized (INT8) vs. FP16 models: The curve shows the trade-off between potential latency reduction and any accuracy loss for a given workload.

Client-Side Performance Modeling

For applications with strict responsiveness requirements (e.g., interactive chat), the curve informs client-side logic and user experience design.

Predictive UX: If the system is operating at high load (right side of the curve), the client can anticipate longer response times and potentially show a progress indicator.
Adaptive request strategies: Clients can implement backoff or request prioritization based on known system operating regions.
Streaming optimization: Understanding the relationship between Time to First Token (TTFT) and Time Per Output Token (TPOT) under load helps design efficient streaming protocols.

THROUGHPUT-LATENCY CURVE

Frequently Asked Questions

A throughput-latency curve is a fundamental performance graph used to characterize and optimize AI inference systems. It reveals the critical trade-off between how many requests a system can handle and how fast it can process them, guiding infrastructure decisions for production deployments.

A throughput-latency curve is a graph that plots the relationship between a system's request throughput (e.g., Queries Per Second (QPS)) and its corresponding average or tail latency (P99/P95), used to identify the optimal operating point before performance degradation. It visualizes the fundamental trade-off in serving systems: as the load (throughput) increases, the time to process each request (latency) also increases, often non-linearly. The curve is essential for defining Service Level Objectives (SLOs) and capacity planning, as it shows the maximum sustainable throughput before latency becomes unacceptable.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

LATENCY BENCHMARKING

Related Terms

Understanding the throughput-latency curve requires familiarity with the specific latency metrics, optimization techniques, and system behaviors that define its shape. These related terms detail the components of end-to-end delay and the methods used to manage it.

Inference Latency

The total time delay between submitting an input to a machine learning model and receiving its corresponding output. This is the core measurement plotted on the Y-axis of a throughput-latency curve. It encompasses:

Compute time for forward passes.
Data transfer between CPU and GPU memory.
Queuing delay if the system is under load. High inference latency directly degrades user experience in real-time applications like chatbots or translation services.

Queries Per Second (QPS)

A throughput metric measuring the number of inference requests a system can successfully process each second. This is the primary variable on the X-axis of a throughput-latency curve. The relationship is non-linear:

At low QPS, latency is stable and minimal.
As QPS increases, resource contention (e.g., GPU saturation, memory bandwidth) causes latency to rise.
Beyond a saturation point, latency increases exponentially, and the system may fail. Engineers use the curve to find the maximum QPS that meets a target latency Service Level Objective (SLO).

Tail Latency (P95/P99)

The high-percentile response times (e.g., the 95th or 99th percentile) that represent the slowest requests in a distribution. While a throughput-latency curve often plots average latency, P99 latency is critical for understanding worst-case user experience and system stability. Under load, tail latency degrades faster than average latency due to factors like:

Resource stragglers (slower GPU cores).
Garbage collection pauses.
Network jitter. Optimizing for tail latency often requires different strategies than optimizing for average latency.

Continuous Batching

An inference optimization technique, also known as dynamic or in-flight batching, where new requests are dynamically added to a running batch as previous requests finish. This is a key method for improving the throughput-latency trade-off. Instead of waiting to fill a static batch (which increases latency), the scheduler:

Groups requests with similar output lengths.
Maximizes GPU utilization by keeping hardware busy.
Reduces average latency under variable load. Frameworks like vLLM and TGI implement continuous batching to achieve superior points on the throughput-latency curve.

Request Queuing Delay

The time an inference request spends waiting in a scheduler's queue before its execution begins. This is a major, often dominant, component of end-to-end latency under high load and a direct contributor to the shape of the throughput-latency curve. As offered load (QPS) approaches system capacity:

Queues form due to finite processing resources.
Queuing delay increases non-linearly.
This causes the characteristic 'knee' in the curve where latency spikes. Effective load shedding and autoscaling are required to manage queuing delay.

Service Level Objective (SLO) for Latency

A target reliability goal defined for a specific latency percentile (e.g., P99 latency < 200ms). SLOs are the business constraints that define the operating region on the throughput-latency curve. Engineering teams use the curve to:

Determine the maximum sustainable throughput that does not violate the SLO.
Establish an error budget for performance regressions.
Guide capacity planning and autoscaling rules. A curve that shows latency exceeding the SLO at low QPS indicates a fundamental performance bottleneck requiring optimization.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.