Inferensys

Glossary

Throughput-Latency Curve

A throughput-latency curve is a graph that plots the relationship between a system's request throughput (e.g., queries per second) and its corresponding average or tail latency, used to identify the optimal operating point before performance degradation.
Developer building agentic RAG system, retrieval pipeline diagram on laptop, technical workspace with notes.
PERFORMANCE ANALYSIS

What is a Throughput-Latency Curve?

A throughput-latency curve is a fundamental graph used in system performance analysis to visualize the trade-off between a service's capacity and its responsiveness under load.

A throughput-latency curve is a graph that plots the relationship between a system's achieved throughput (e.g., queries per second) and its corresponding average or tail latency, revealing the performance envelope and optimal operating point before severe degradation. This curve is generated by applying a steadily increasing load to a system—such as a model inference server—and measuring the resulting latency at each throughput level. The characteristic shape typically shows latency remaining low and stable until a saturation point, after which queuing effects dominate and latency increases exponentially while throughput plateaus.

For AI inference services, this curve is critical for capacity planning and setting Service Level Objectives (SLOs). It identifies the maximum sustainable throughput before violating latency targets (e.g., P99 < 200ms). The knee of the curve represents the optimal operating region, balancing resource utilization with acceptable responsiveness. Performance regressions, optimization gains, or the impact of techniques like continuous batching and model quantization are quantitatively validated by shifts in this curve, making it an essential tool for latency benchmarking and infrastructure engineering.

PERFORMANCE ANALYSIS

Key Characteristics of the Curve

A throughput-latency curve is a fundamental tool for characterizing the performance envelope of an inference serving system. It reveals the trade-off between processing capacity and responsiveness under load.

01

The Knee Point

The knee point (or elbow) of the curve is the critical operational region where latency begins to increase non-linearly with added throughput. It represents the maximum sustainable throughput before queuing delays dominate.

  • Definition: The inflection point where the slope of the latency curve increases sharply.
  • Operational Significance: Running near, but not beyond, the knee point maximizes resource utilization while maintaining acceptable latency SLOs.
  • Identification: Found by incrementally increasing request load (QPS) and observing where average or P95 latency deviates from a near-constant baseline.
02

Saturation & Degradation

Beyond the knee point, the system enters saturation. Latency increases exponentially as the request arrival rate exceeds the system's maximum service rate, causing unbounded queue growth.

  • Queuing Theory: Modeled by M/M/1 or M/G/1 queues; latency asymptotically approaches infinity as throughput nears theoretical maximum.
  • Symptoms: Rapid growth in P99 tail latency, increased error rates, and potential system instability.
  • Cause: Resource exhaustion (GPU, CPU, memory bandwidth) or scheduler limits, where request queuing delay becomes the primary latency component.
03

Optimal Operating Region

The optimal operating region is the plateau to the left of the knee, where latency is stable and predictable for a given range of throughput. This is the target for production SLOs.

  • Characteristics: Latency is minimally affected by small load fluctuations. Throughput scales nearly linearly with added concurrent requests.
  • Engineering Goal: Size autoscaling policies and load balancers to keep the system within this region under expected traffic patterns.
  • Determination: Established via load testing to map the curve, then setting throughput limits at 70-90% of the knee point value for a safety margin.
04

Impact of Batching

Batching strategies dramatically reshape the throughput-latency curve. Static batching creates a step-function, while continuous batching (e.g., in vLLM) produces a smoother, more efficient curve.

  • Static Batching: Fixed batch size trades off higher latency for higher throughput; the curve shows discrete performance points for each batch size.
  • Continuous/Dynamic Batching: New requests are added to running batches as others finish. This maximizes GPU utilization, pushing the knee point to a higher throughput for the same latency, creating a superior Pareto frontier.
  • Visual Cue: A family of curves, each for a different batching configuration, is used to select the optimal strategy.
05

Inference Configuration Effects

Model-serving parameters and hardware choices translate directly to shifts in the curve's position. Optimizations move the entire curve favorably.

  • Quantization (FP16/INT8): Reduces compute and memory bandwidth needs, often lowering the latency axis (faster responses) and sometimes raising the throughput knee point.
  • KV Cache Optimization (PagedAttention): Reduces memory fragmentation, allowing larger effective batch sizes, which shifts the knee point to the right (higher throughput).
  • Compiler Optimizations (TensorRT): Operator fusion and kernel tuning reduce GPU kernel launch overhead, improving both latency and throughput, lifting the entire curve.
  • Speculative Decoding: Reduces time per output token (TPOT), flattening the latency curve, especially for longer generations.
06

Practical Measurement & Use

Generating an accurate curve requires controlled load testing that mimics production traffic patterns, including variable input lengths and request distributions.

  • Tooling: Use load-testing frameworks (e.g., Locust, Vegeta) or specialized ML serving benchmarks (e.g., MLPerf Inference).
  • Procedure: Ramp up request rate (QPS) in steps, measuring average and percentile latencies (P50, P95, P99) at each step until saturation is clear.
  • Analysis for SLOs: The curve directly informs Service Level Objective (SLO) definition. For example, an SLO of 'P99 latency < 1s' corresponds to a maximum allowable throughput read directly from the curve.
  • Capacity Planning: The curve dictates the number of instances needed to handle a target peak QPS while remaining in the optimal region.
OPERATIONAL REGIMES

Regions of a Throughput-Latency Curve

This table defines the distinct performance regimes observed when plotting a system's throughput against its latency, used to identify the optimal operating point and predict degradation.

RegionThroughputLatency BehaviorSystem StatePrimary Bottleneck

Underloaded

Low to Moderate

Constant & Low

Idle resources available; requests processed immediately.

None (Client-limited)

Elastic

Increasing Linearly

Stable, Slight Increase

Resources fully utilized; queue begins to form.

GPU Compute / Batch Processing

Knee Point (Optimal)

Peak Sustainable

Latency starts rising non-linearly

Maximum efficient throughput before severe queuing.

GPU Memory Bandwidth / Scheduler

Saturated

Plateaus

Rapidly Increasing

Queue growing faster than processing rate; latency highly variable.

Request Queuing Delay

Overloaded

Degrades

Unbounded Increase

System unstable; requests may time out or fail.

CPU Scheduler / Memory Thrashing

Characteristic Metric

Queries Per Second (QPS)

P50, P95, P99 Latency

Concurrent Requests in Flight

GPU Utilization, Queue Depth

LATENCY BENCHMARKING

Practical Applications in AI Systems

The throughput-latency curve is a fundamental tool for capacity planning and performance optimization in production AI systems. These cards detail its critical applications for infrastructure engineers and CTOs.

01

Capacity Planning & SLO Definition

The curve provides the empirical data needed to define Service Level Objectives (SLOs) for latency and throughput. Engineers use it to answer:

  • What is the maximum sustainable QPS while keeping P99 latency under 200ms?
  • How many replicas are required to handle a forecasted peak load of X requests per second?
  • Where is the 'knee' of the curve beyond which latency degrades exponentially? This identifies the safe operating limit before provisioning additional resources.
02

Cost-Performance Optimization

This application directly addresses the CTO's mandate for infrastructure cost control. The curve reveals the optimal operating point that balances throughput (cost efficiency) with acceptable latency (user experience).

  • Identify diminishing returns: Adding more concurrent requests may yield minimal throughput gains while drastically increasing tail latency.
  • Right-size instances: Determine if a system is CPU-bound, GPU-bound, or memory-bandwidth-bound, guiding hardware selection (e.g., fewer high-memory instances vs. more smaller instances).
  • Evaluate optimization ROI: Quantify the latency/throughput improvement from techniques like continuous batching or model quantization to justify engineering investment.
03

Bottleneck Identification & Profiling

The shape of the throughput-latency curve points to specific system bottlenecks. A sharp latency increase at low load suggests a serial bottleneck (e.g., single-threaded preprocessing). A gradual curve that plateaus indicates a saturated hardware resource (e.g., GPU compute or memory bandwidth).

  • Flat latency, then spike: Often indicates queue saturation after a fixed number of parallel workers (e.g., GPU threads) are occupied.
  • Compare curves: Generate separate curves for different components (e.g., prefill phase vs. decode phase) using profiling tools like PyTorch Profiler or NVIDIA Nsight to isolate the slowest stage.
04

Load Testing & Regression Detection

A baseline throughput-latency curve serves as a contract for system performance. It is used during canary analysis and load testing to detect regressions.

  • Pre-deployment validation: A new model version or server configuration must meet or improve upon the baseline curve.
  • Automated regression testing: Performance tests can fail if latency at target QPS exceeds the baseline by a defined margin (e.g., 10%).
  • Understanding autoscaling lag: The curve shows the performance penalty incurred during the delay between a traffic spike and new resource provisioning.
05

Architecture Selection & Comparison

The curve is the primary metric for comparing different inference engines, hardware, and model optimizations. Engineers generate curves for each candidate to make data-driven decisions.

  • vLLM vs. TensorRT: Compare throughput at target latency for the same model.
  • Synchronous vs. Asynchronous inference: The curve differs significantly; async APIs can offer higher throughput but complicate latency measurement and client-side logic.
  • Quantized (INT8) vs. FP16 models: The curve shows the trade-off between potential latency reduction and any accuracy loss for a given workload.
06

Client-Side Performance Modeling

For applications with strict responsiveness requirements (e.g., interactive chat), the curve informs client-side logic and user experience design.

  • Predictive UX: If the system is operating at high load (right side of the curve), the client can anticipate longer response times and potentially show a progress indicator.
  • Adaptive request strategies: Clients can implement backoff or request prioritization based on known system operating regions.
  • Streaming optimization: Understanding the relationship between Time to First Token (TTFT) and Time Per Output Token (TPOT) under load helps design efficient streaming protocols.
THROUGHPUT-LATENCY CURVE

Frequently Asked Questions

A throughput-latency curve is a fundamental performance graph used to characterize and optimize AI inference systems. It reveals the critical trade-off between how many requests a system can handle and how fast it can process them, guiding infrastructure decisions for production deployments.

A throughput-latency curve is a graph that plots the relationship between a system's request throughput (e.g., Queries Per Second (QPS)) and its corresponding average or tail latency (P99/P95), used to identify the optimal operating point before performance degradation. It visualizes the fundamental trade-off in serving systems: as the load (throughput) increases, the time to process each request (latency) also increases, often non-linearly. The curve is essential for defining Service Level Objectives (SLOs) and capacity planning, as it shows the maximum sustainable throughput before latency becomes unacceptable.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.