Queries Per Second (QPS) is a throughput metric that measures the maximum number of inference requests a system can successfully process each second, typically evaluated while adhering to a defined latency Service Level Objective (SLO). It quantifies a system's capacity under load, representing the sustainable operational ceiling before performance degrades. In latency benchmarking, QPS is not measured in isolation but is intrinsically linked to response time, forming a throughput-latency curve that defines the optimal operating point for production deployment.
Glossary
Queries Per Second (QPS)

What is Queries Per Second (QPS)?
Queries Per Second (QPS) is a fundamental throughput metric for evaluating the performance of AI inference serving systems.
For infrastructure engineers and CTOs, optimizing QPS involves balancing computational efficiency against inference latency. Techniques like continuous batching and efficient KV cache management in engines like vLLM are employed to maximize QPS. The metric is critical for capacity planning, cost estimation, and ensuring a system can handle peak traffic loads without violating its latency SLO, making it a cornerstone of evaluation-driven development for production AI services.
Key Characteristics of QPS
Queries Per Second (QPS) is a throughput metric measuring the number of inference requests a system can successfully process each second. Its practical utility is defined by its relationship with latency and operational constraints.
Throughput vs. Latency Trade-off
QPS and latency are intrinsically linked. A system's throughput-latency curve shows that as QPS increases, average and tail latency (P99/P95) typically rise due to resource contention and request queuing delay. The optimal operating point is the highest sustainable QPS before latency exceeds the Service Level Objective (SLO). Increasing QPS beyond this point causes latency to spike exponentially.
Defined by a Latency SLO
A QPS value is meaningless without a corresponding latency target. A valid specification is: '500 QPS at P99 latency < 200ms'. This means the system can handle 500 requests per second while ensuring 99% of requests complete within 200ms. The SLO for latency defines the quality of service. Measuring QPS without monitoring latency leads to a degraded user experience under load.
Dependent on System Load & Concurrency
QPS is a measure of load. It is directly influenced by the number of concurrent requests the system is processing. Key factors determining achievable QPS include:
- Hardware Capacity: GPU memory bandwidth, CPU cores, and network I/O.
- Model Characteristics: Size, architecture, and computational graph.
- Optimization Techniques: Use of continuous batching, model quantization (INT8/FP16), and efficient kernels via TensorRT.
- Request Profile: Payload size, input/output token length, and complexity.
A Composite, Not Isolated, Metric
QPS is the aggregate result of multiple underlying performance factors. It cannot be optimized in isolation. Improving QPS requires addressing specific bottlenecks:
- Inference Latency: Reducing prefilling and decoding latency.
- Memory Efficiency: Using PagedAttention (as in vLLM) to reduce KV cache waste.
- Overhead Reduction: Minimizing GPU kernel launch overhead through operator fusion.
- System Lag: Accounting for autoscaling lag and cold start latency during traffic fluctuations.
Primary Use: Capacity Planning & Scaling
QPS is the fundamental metric for infrastructure sizing and cost forecasting. Engineers use it to:
- Determine the number of GPU instances required to meet expected traffic.
- Set scaling rules for cloud deployments.
- Calculate the cost per query for a deployed model.
- Establish performance baselines and detect regressions via canary analysis. It translates business demand (user requests) into technical resource requirements.
Benchmarked Under Realistic Conditions
Accurate QPS measurement requires load testing that mimics production. This includes:
- Variable request patterns (bursts, sustained load).
- Realistic input lengths and payloads.
- A mix of synchronous and asynchronous inference patterns.
- Monitoring of both end-to-end latency (user perspective) and time to first token (TTFT) for streaming.
- Tools like profiling (CPU/GPU) and distributed tracing are essential to move from measuring QPS to understanding the bottleneck identification that limits it.
The Critical Relationship: QPS vs. Latency
In production AI systems, throughput and latency are intrinsically linked performance metrics that define the capacity and responsiveness of an inference service.
Queries Per Second (QPS) is a throughput metric measuring the maximum number of successful inference requests a system can process per second. It is not measured in isolation but is evaluated against a target Service Level Objective (SLO) for latency, such as P99 < 200ms. The relationship is defined by a throughput-latency curve, where increasing QPS typically increases average and tail latency due to resource contention and request queuing delay.
Engineering the optimal operating point involves balancing QPS and latency through techniques like continuous batching and efficient KV cache management. Exceeding a system's optimal QPS causes latency to spike non-linearly, violating SLOs. Therefore, performance baselines and canary analysis are essential for establishing sustainable QPS limits under real-world concurrent request loads while meeting latency guarantees.
QPS vs. Related Performance Metrics
A comparison of Queries Per Second (QPS) with other key metrics used to evaluate the performance and efficiency of AI inference serving systems.
| Metric | Queries Per Second (QPS) | Inference Latency | Concurrent Requests | Throughput-Latency Curve |
|---|---|---|---|---|
Core Definition | Throughput: Successful requests processed per second. | Time delay: Request submission to response receipt. | Load: Simultaneous requests being processed. | Relationship: Graph plotting QPS against latency. |
Primary Unit | Requests/Second | Milliseconds (ms) | Count (Integer) | Graph (2D Plot) |
Measurement Focus | System capacity under sustained load. | User-perceived responsiveness for a single request. | Instantaneous system load and resource utilization. | Optimal operating point before performance degradation. |
Key Dependency | Heavily dependent on meeting a target latency Service Level Objective (SLO). | Independent base measurement; defines the SLO for QPS. | Direct driver of queuing delay and resource contention. | Derived from sweeping measurements of QPS and latency. |
Optimization Goal | Maximize while adhering to latency SLO. | Minimize for a given request profile. | Manage to prevent queue overflows and SLO breaches. | Identify the 'knee' where latency rises sharply. |
Primary Bottleneck | GPU compute throughput, memory bandwidth, batch scheduling efficiency. | Slowest component in the pipeline (CPU, GPU, network, queue). | Available hardware resources (GPU memory, compute cores). | Reveals the limiting resource (e.g., compute vs. memory bandwidth). |
Use in SLO Definition | Target throughput at a specified latency percentile (e.g., P99 < 200ms). | The percentile target itself (e.g., P99 latency). | Often used to define scaling thresholds and queue limits. | Used to model and predict SLO compliance under load. |
Impact of Techniques | Increased by continuous batching, quantization, and optimized kernels. | Reduced by operator fusion, faster hardware, and speculative decoding. | Increased capacity via model scaling and efficient KV cache management (e.g., PagedAttention). | Curve shifts right/up with optimizations that improve efficiency. |
Technical Factors Affecting QPS
Queries Per Second (QPS) is a throughput metric measuring the number of inference requests a system can successfully process each second. The achievable QPS is not a static value but a dynamic equilibrium point determined by the complex interplay of hardware, software, and system architecture under a defined latency Service Level Objective (SLO).
Hardware & Compute Resources
The raw computational capacity of the underlying hardware is the fundamental ceiling for QPS. Key factors include:
- GPU/Accelerator Throughput: The FLOPs (Floating-Point Operations per Second) and memory bandwidth of the inference accelerator (e.g., NVIDIA H100, A100) directly limit how many model forward passes can be executed per second.
- CPU & Memory: The host CPU speed and system RAM bandwidth affect pre/post-processing, tokenization, and orchestration overhead, which can bottleneck the accelerator.
- Network Interface: For distributed systems, the bandwidth and latency of the network card (e.g., NVLink, InfiniBand) govern how quickly data can be sharded or aggregated across nodes.
Model Architecture & Size
The intrinsic complexity of the model being served is a primary determinant of per-request compute cost.
- Parameter Count & Layers: Larger models (e.g., 70B+ parameters) require more computations and memory transfers per token, reducing potential QPS compared to smaller models (e.g., 7B parameters) on the same hardware.
- Attention Mechanism: The quadratic complexity of standard attention with sequence length is a major bottleneck. Architectures with linear or sparse attention (e.g., MQA, GQA) can significantly improve QPS for long contexts.
- Activation Memory: The size of intermediate activations during inference impacts memory bandwidth pressure and cache efficiency.
Inference Optimization Techniques
Software-level optimizations dramatically increase QPS by improving hardware utilization.
- Continuous Batching: Dynamically batches incoming requests of varying lengths, keeping the GPU saturated even as individual requests finish, often increasing throughput 5-10x over static batching.
- Model Quantization: Reducing weight and activation precision from FP32 to FP16, INT8, or INT4 (e.g., via GPTQ, AWQ) cuts memory footprint and increases compute speed on supported hardware, directly boosting QPS.
- Kernel Fusion & Graph Optimization: Compilers like TensorRT or ONNX Runtime fuse sequential operations into single, optimized GPU kernels, reducing launch overhead and memory I/O.
- Speculative Decoding: Uses a small draft model to propose token sequences verified in parallel by the main model, reducing the number of slow autoregressive steps and improving QPS for longer outputs.
Memory & Cache Management
Efficient memory usage determines how many concurrent requests can be handled.
- KV Cache Efficiency: The Key-Value cache for autoregressive models consumes vast memory. PagedAttention (as used in vLLM) eliminates fragmentation by managing the KV cache in non-contiguous blocks, allowing much higher concurrency and QPS.
- Static vs. Dynamic Shapes: Systems that can handle variable sequence lengths without recompilation (dynamic shapes) are more flexible but may sacrifice some peak QPS compared to pre-compiled static shapes.
- CPU-GPU Data Transfer: Minimizing the movement of data across the PCIe bus (e.g., by keeping tokenizers on GPU) reduces per-request overhead.
Serving System & Scheduling
The design of the inference server and its scheduler dictates how work is parallelized.
- Request Scheduling Policy: Policies like First-In-First-Out (FIFO) or shortest-job-first affect queuing delays and fairness, influencing the throughput-latency trade-off.
- Concurrent Request Limit: The maximum number of requests processed simultaneously is tuned to maximize GPU utilization without causing excessive request queuing delay or memory overflow.
- Autoscaling & Load Balancing: The speed and efficiency with which a cluster can scale replicas in response to load (autoscaling lag) determines the system's ability to maintain QPS during traffic spikes.
The Throughput-Latency Trade-off
QPS cannot be evaluated in isolation; it exists on a throughput-latency curve. Pushing for maximum QPS by overloading the system with concurrent requests will inevitably increase average and tail latency (P99/P95).
- Operating Point Selection: The target QPS is chosen based on a Service Level Objective (SLO) for Latency (e.g., P99 < 1s). The system is provisioned and tuned to operate at the QPS that meets this SLO.
- Performance Baseline: Establishing a baseline under a target load is essential for detecting regressions. Tools like profiling (CPU/GPU) and distributed tracing are used for bottleneck identification to optimize this trade-off.
- Canary Analysis: New model versions or configurations are tested against the baseline QPS/latency on a subset of traffic before full deployment.
Frequently Asked Questions
Queries Per Second (QPS) is a fundamental throughput metric for AI inference systems, measuring the number of requests successfully processed per second. These questions address its calculation, trade-offs, and role in production performance management.
Queries Per Second (QPS) is a throughput metric that measures the number of inference requests a system can successfully process and return within one second. It is calculated by dividing the total number of successful requests completed within a measurement window by the duration of that window in seconds. For example, if a service processes 12,000 successful requests in 60 seconds, its QPS is 200.
Key Calculation Notes:
- Only successful requests (e.g., returning a valid HTTP 200 response) are typically counted towards QPS. Failed or errored requests are excluded.
- The measurement window must be long enough to smooth out transient spikes (e.g., 1-5 minutes is common).
- QPS is often reported as an average but should be monitored alongside its distribution (e.g., P50, P99 QPS) to understand consistency.
- The formula is:
QPS = (Total Successful Requests) / (Measurement Window in Seconds).
QPS is a direct indicator of a system's processing capacity and is the primary metric for scaling decisions and cost-per-inference calculations.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Queries Per Second (QPS) is a throughput metric, but its practical value is defined in relation to latency constraints. These related terms detail the components and trade-offs of a performant inference system.
Inference Latency
Inference latency is the total time delay between submitting an input to a machine learning model and receiving its output. It is the foundational constraint against which QPS is measured. A system's maximum sustainable QPS is the point where adding more queries causes latency to exceed a defined Service Level Objective (SLO).
- Components: Includes pre-processing, model forward pass, post-processing, and any network transfer.
- Primary Trade-off: Systems are typically tuned to maximize QPS while keeping latency (especially tail latency) below an acceptable threshold.
Throughput-Latency Curve
A throughput-latency curve is a graph that plots the relationship between a system's request throughput (QPS) and its corresponding average or percentile latency. It is the essential tool for capacity planning and performance optimization.
- Shape: Latency remains low and stable until a saturation point, after which it increases exponentially as queues form.
- Use Case: Engineers use this curve to identify the optimal operating point—the highest QPS achievable before latency degrades unacceptably. This defines the practical QPS limit.
Concurrent Requests
Concurrent requests are the number of client inference queries being processed simultaneously by a serving system. It is a direct driver of resource utilization and a key factor linking QPS and latency.
- Relationship to QPS: QPS = Concurrent Requests / Average Latency (Little's Law).
- Impact: High concurrency increases GPU utilization but can lead to request queuing delay if it exceeds the system's parallel processing capacity, directly increasing latency.
Service Level Objective (SLO) for Latency
A Service Level Objective (SLO) for latency is a target reliability goal defined for a specific latency percentile (e.g., P99 < 200ms). It is the contractual performance benchmark that a system's QPS must satisfy.
- Defines QPS Validity: A reported QPS figure is meaningless without an associated latency SLO. A system might handle 1000 QPS at P50 latency but only 200 QPS at a strict P99 latency SLO.
- Error Budgets: SLOs create error budgets for performance, guiding decisions on autoscaling, deployment strategies like canary analysis, and infrastructure investment.
Continuous Batching
Continuous batching (or dynamic batching) is a critical inference optimization technique that directly maximizes QPS for a given latency target. It dynamically adds new requests to a running batch on the GPU as previous requests finish generation.
- Efficiency vs. Static Batching: Eliminates the need to wait for a fixed batch to complete, dramatically improving GPU utilization and throughput.
- Latency Benefit: Reduces average latency for individual requests compared to waiting for a full static batch, allowing higher QPS within latency SLOs. Engines like vLLM implement this.
Tail Latency (P99/P95)
Tail latency refers to the high-percentile response times (e.g., P95, P99) that represent the slowest requests in a distribution. Managing tail latency is paramount for defining a realistic, user-experience-focused QPS.
- Importance: While average latency might be low, a high P99 latency means 1% of users have a poor experience. QPS must be capped to keep tail latency in check.
- Causes: Often caused by variance in request queuing delay, cold start latency, garbage collection, or network interference. Optimization focuses on reducing variance, not just average speed.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us