Inferensys

Glossary

Queries Per Second (QPS)

Queries Per Second (QPS) is a throughput metric measuring the number of inference requests a system can successfully process each second, often evaluated against a target latency Service Level Objective (SLO).
Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.
LATENCY BENCHMARKING

What is Queries Per Second (QPS)?

Queries Per Second (QPS) is a fundamental throughput metric for evaluating the performance of AI inference serving systems.

Queries Per Second (QPS) is a throughput metric that measures the maximum number of inference requests a system can successfully process each second, typically evaluated while adhering to a defined latency Service Level Objective (SLO). It quantifies a system's capacity under load, representing the sustainable operational ceiling before performance degrades. In latency benchmarking, QPS is not measured in isolation but is intrinsically linked to response time, forming a throughput-latency curve that defines the optimal operating point for production deployment.

For infrastructure engineers and CTOs, optimizing QPS involves balancing computational efficiency against inference latency. Techniques like continuous batching and efficient KV cache management in engines like vLLM are employed to maximize QPS. The metric is critical for capacity planning, cost estimation, and ensuring a system can handle peak traffic loads without violating its latency SLO, making it a cornerstone of evaluation-driven development for production AI services.

LATENCY BENCHMARKING

Key Characteristics of QPS

Queries Per Second (QPS) is a throughput metric measuring the number of inference requests a system can successfully process each second. Its practical utility is defined by its relationship with latency and operational constraints.

01

Throughput vs. Latency Trade-off

QPS and latency are intrinsically linked. A system's throughput-latency curve shows that as QPS increases, average and tail latency (P99/P95) typically rise due to resource contention and request queuing delay. The optimal operating point is the highest sustainable QPS before latency exceeds the Service Level Objective (SLO). Increasing QPS beyond this point causes latency to spike exponentially.

02

Defined by a Latency SLO

A QPS value is meaningless without a corresponding latency target. A valid specification is: '500 QPS at P99 latency < 200ms'. This means the system can handle 500 requests per second while ensuring 99% of requests complete within 200ms. The SLO for latency defines the quality of service. Measuring QPS without monitoring latency leads to a degraded user experience under load.

03

Dependent on System Load & Concurrency

QPS is a measure of load. It is directly influenced by the number of concurrent requests the system is processing. Key factors determining achievable QPS include:

  • Hardware Capacity: GPU memory bandwidth, CPU cores, and network I/O.
  • Model Characteristics: Size, architecture, and computational graph.
  • Optimization Techniques: Use of continuous batching, model quantization (INT8/FP16), and efficient kernels via TensorRT.
  • Request Profile: Payload size, input/output token length, and complexity.
04

A Composite, Not Isolated, Metric

QPS is the aggregate result of multiple underlying performance factors. It cannot be optimized in isolation. Improving QPS requires addressing specific bottlenecks:

  • Inference Latency: Reducing prefilling and decoding latency.
  • Memory Efficiency: Using PagedAttention (as in vLLM) to reduce KV cache waste.
  • Overhead Reduction: Minimizing GPU kernel launch overhead through operator fusion.
  • System Lag: Accounting for autoscaling lag and cold start latency during traffic fluctuations.
05

Primary Use: Capacity Planning & Scaling

QPS is the fundamental metric for infrastructure sizing and cost forecasting. Engineers use it to:

  • Determine the number of GPU instances required to meet expected traffic.
  • Set scaling rules for cloud deployments.
  • Calculate the cost per query for a deployed model.
  • Establish performance baselines and detect regressions via canary analysis. It translates business demand (user requests) into technical resource requirements.
06

Benchmarked Under Realistic Conditions

Accurate QPS measurement requires load testing that mimics production. This includes:

  • Variable request patterns (bursts, sustained load).
  • Realistic input lengths and payloads.
  • A mix of synchronous and asynchronous inference patterns.
  • Monitoring of both end-to-end latency (user perspective) and time to first token (TTFT) for streaming.
  • Tools like profiling (CPU/GPU) and distributed tracing are essential to move from measuring QPS to understanding the bottleneck identification that limits it.
EVALUATION-DRIVEN DEVELOPMENT

The Critical Relationship: QPS vs. Latency

In production AI systems, throughput and latency are intrinsically linked performance metrics that define the capacity and responsiveness of an inference service.

Queries Per Second (QPS) is a throughput metric measuring the maximum number of successful inference requests a system can process per second. It is not measured in isolation but is evaluated against a target Service Level Objective (SLO) for latency, such as P99 < 200ms. The relationship is defined by a throughput-latency curve, where increasing QPS typically increases average and tail latency due to resource contention and request queuing delay.

Engineering the optimal operating point involves balancing QPS and latency through techniques like continuous batching and efficient KV cache management. Exceeding a system's optimal QPS causes latency to spike non-linearly, violating SLOs. Therefore, performance baselines and canary analysis are essential for establishing sustainable QPS limits under real-world concurrent request loads while meeting latency guarantees.

LATENCY BENCHMARKING

Technical Factors Affecting QPS

Queries Per Second (QPS) is a throughput metric measuring the number of inference requests a system can successfully process each second. The achievable QPS is not a static value but a dynamic equilibrium point determined by the complex interplay of hardware, software, and system architecture under a defined latency Service Level Objective (SLO).

01

Hardware & Compute Resources

The raw computational capacity of the underlying hardware is the fundamental ceiling for QPS. Key factors include:

  • GPU/Accelerator Throughput: The FLOPs (Floating-Point Operations per Second) and memory bandwidth of the inference accelerator (e.g., NVIDIA H100, A100) directly limit how many model forward passes can be executed per second.
  • CPU & Memory: The host CPU speed and system RAM bandwidth affect pre/post-processing, tokenization, and orchestration overhead, which can bottleneck the accelerator.
  • Network Interface: For distributed systems, the bandwidth and latency of the network card (e.g., NVLink, InfiniBand) govern how quickly data can be sharded or aggregated across nodes.
02

Model Architecture & Size

The intrinsic complexity of the model being served is a primary determinant of per-request compute cost.

  • Parameter Count & Layers: Larger models (e.g., 70B+ parameters) require more computations and memory transfers per token, reducing potential QPS compared to smaller models (e.g., 7B parameters) on the same hardware.
  • Attention Mechanism: The quadratic complexity of standard attention with sequence length is a major bottleneck. Architectures with linear or sparse attention (e.g., MQA, GQA) can significantly improve QPS for long contexts.
  • Activation Memory: The size of intermediate activations during inference impacts memory bandwidth pressure and cache efficiency.
03

Inference Optimization Techniques

Software-level optimizations dramatically increase QPS by improving hardware utilization.

  • Continuous Batching: Dynamically batches incoming requests of varying lengths, keeping the GPU saturated even as individual requests finish, often increasing throughput 5-10x over static batching.
  • Model Quantization: Reducing weight and activation precision from FP32 to FP16, INT8, or INT4 (e.g., via GPTQ, AWQ) cuts memory footprint and increases compute speed on supported hardware, directly boosting QPS.
  • Kernel Fusion & Graph Optimization: Compilers like TensorRT or ONNX Runtime fuse sequential operations into single, optimized GPU kernels, reducing launch overhead and memory I/O.
  • Speculative Decoding: Uses a small draft model to propose token sequences verified in parallel by the main model, reducing the number of slow autoregressive steps and improving QPS for longer outputs.
04

Memory & Cache Management

Efficient memory usage determines how many concurrent requests can be handled.

  • KV Cache Efficiency: The Key-Value cache for autoregressive models consumes vast memory. PagedAttention (as used in vLLM) eliminates fragmentation by managing the KV cache in non-contiguous blocks, allowing much higher concurrency and QPS.
  • Static vs. Dynamic Shapes: Systems that can handle variable sequence lengths without recompilation (dynamic shapes) are more flexible but may sacrifice some peak QPS compared to pre-compiled static shapes.
  • CPU-GPU Data Transfer: Minimizing the movement of data across the PCIe bus (e.g., by keeping tokenizers on GPU) reduces per-request overhead.
05

Serving System & Scheduling

The design of the inference server and its scheduler dictates how work is parallelized.

  • Request Scheduling Policy: Policies like First-In-First-Out (FIFO) or shortest-job-first affect queuing delays and fairness, influencing the throughput-latency trade-off.
  • Concurrent Request Limit: The maximum number of requests processed simultaneously is tuned to maximize GPU utilization without causing excessive request queuing delay or memory overflow.
  • Autoscaling & Load Balancing: The speed and efficiency with which a cluster can scale replicas in response to load (autoscaling lag) determines the system's ability to maintain QPS during traffic spikes.
06

The Throughput-Latency Trade-off

QPS cannot be evaluated in isolation; it exists on a throughput-latency curve. Pushing for maximum QPS by overloading the system with concurrent requests will inevitably increase average and tail latency (P99/P95).

  • Operating Point Selection: The target QPS is chosen based on a Service Level Objective (SLO) for Latency (e.g., P99 < 1s). The system is provisioned and tuned to operate at the QPS that meets this SLO.
  • Performance Baseline: Establishing a baseline under a target load is essential for detecting regressions. Tools like profiling (CPU/GPU) and distributed tracing are used for bottleneck identification to optimize this trade-off.
  • Canary Analysis: New model versions or configurations are tested against the baseline QPS/latency on a subset of traffic before full deployment.
QUERIES PER SECOND (QPS)

Frequently Asked Questions

Queries Per Second (QPS) is a fundamental throughput metric for AI inference systems, measuring the number of requests successfully processed per second. These questions address its calculation, trade-offs, and role in production performance management.

Queries Per Second (QPS) is a throughput metric that measures the number of inference requests a system can successfully process and return within one second. It is calculated by dividing the total number of successful requests completed within a measurement window by the duration of that window in seconds. For example, if a service processes 12,000 successful requests in 60 seconds, its QPS is 200.

Key Calculation Notes:

  • Only successful requests (e.g., returning a valid HTTP 200 response) are typically counted towards QPS. Failed or errored requests are excluded.
  • The measurement window must be long enough to smooth out transient spikes (e.g., 1-5 minutes is common).
  • QPS is often reported as an average but should be monitored alongside its distribution (e.g., P50, P99 QPS) to understand consistency.
  • The formula is: QPS = (Total Successful Requests) / (Measurement Window in Seconds).

QPS is a direct indicator of a system's processing capacity and is the primary metric for scaling decisions and cost-per-inference calculations.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.