Inferensys

Glossary

Tokens per Second (TPS)

Tokens per Second (TPS) is a throughput metric that quantifies the number of output tokens a large language model inference system can generate in one second.
ML engineer working on model compression and quantization, laptop showing performance benchmarks, technical workspace.
LLM PERFORMANCE MONITORING

What is Tokens per Second (TPS)?

A core throughput metric for evaluating the operational efficiency of large language model inference systems.

Tokens per Second (TPS) is a throughput metric that quantifies the number of output tokens an LLM inference system can generate in one second, typically reported as a peak or sustained rate under specific hardware, batching, and sequence length configurations. It is a critical Key Performance Indicator (KPI) for production deployments, directly correlating with infrastructure cost and user experience for streaming applications. High TPS indicates efficient utilization of GPU or NPU resources, often achieved through techniques like continuous batching and optimized KV cache management.

Monitoring TPS alongside latency metrics like Time to First Token (TTFT) and inter-token latency provides a complete view of system performance. Engineers track TPS to benchmark hardware, optimize inference parameters, and validate that deployments meet Service Level Objectives (SLOs). It is influenced by model architecture size, quantization level, and the efficiency of the underlying serving framework, making it a fundamental measure for LLM performance monitoring and capacity planning.

PERFORMANCE OPTIMIZATION

Key Factors Influencing TPS

Tokens per Second (TPS) is a critical throughput metric for LLM inference. Its value is not intrinsic to a model but is determined by a complex interplay of hardware capabilities, software optimizations, and request characteristics.

01

Hardware & Compute

The foundational layer determining peak TPS. Key components include:

  • GPU/Accelerator: The type (e.g., H100, A100), memory bandwidth, and number of devices. Higher FLOPs and memory bandwidth directly increase token generation speed.
  • Memory Capacity: The size of VRAM dictates the maximum model size and batch size that can be loaded. Insufficient memory forces costly swapping to CPU RAM, crippling TPS.
  • Interconnect: The speed of links between GPUs (e.g., NVLink, InfiniBand) is crucial for multi-GPU inference, affecting how efficiently workloads are distributed.
02

Model Architecture & Size

The model's own design imposes fundamental constraints.

  • Parameter Count: Larger models (e.g., 70B vs 7B parameters) require more computations per token, reducing TPS for a given hardware setup.
  • Attention Mechanism: The complexity of the attention operation (e.g., standard vs. grouped-query or multi-query attention) significantly impacts the computational graph. Optimized variants can dramatically improve TPS.
  • Precision: Using lower numerical precision (e.g., FP16, BF16, INT8, or INT4 quantization) reduces memory footprint and increases compute efficiency, often doubling or tripling TPS compared to FP32.
03

Inference Serving & Batching

Software orchestration of requests is paramount for achieving high aggregate TPS.

  • Static vs. Continuous Batching: Static batching processes a fixed set of requests together. Continuous batching (or iteration-level batching) dynamically adds new requests to the batch as others finish, dramatically improving GPU utilization and overall system TPS.
  • KV Cache Management: Efficient caching of previous tokens' Key and Value vectors in the attention mechanism avoids redundant computation. The size and management strategy of this cache directly affect memory usage and decode speed.
  • Serving Framework: Specialized frameworks like vLLM, TensorRT-LLM, or TGI implement these optimizations, offering vastly higher TPS than naive model serving.
04

Request Characteristics

The nature of the user's input and desired output determines the workload for each request.

  • Input (Prompt) Length: Longer prompts increase the time and memory required for the initial prefill phase, impacting Time to First Token (TTFT) and the efficiency of the initial batch computation.
  • Output (Generation) Length: The number of tokens to be generated defines the duration of the decode phase. Longer generations increase total request latency but, with efficient batching, can improve overall system TPS by keeping GPUs saturated.
  • Decoding Strategy: Greedy decoding (selecting the highest probability token) is fastest. Sampling methods (top-k, top-p, temperature) introduce minor overhead. Complex strategies like beam search multiply the computational cost, severely reducing TPS.
05

System & Network Overhead

Infrastructure factors outside the core model execution can become bottlenecks.

  • Host CPU & I/O: The CPU must be powerful enough to handle tokenization, detokenization, and framework overhead without stalling the GPU. Slow disk I/O can delay model loading.
  • Network Latency: In distributed or client-server setups, the time to transmit prompts and stream tokens adds to total response time, though it doesn't reduce the model's intrinsic TPS.
  • Orchestration & Queuing: Load balancers, API gateways, and request queues add minimal but non-zero latency. Under high load, queuing delays can dominate the user-perceived latency, masking the actual TPS of the inference engine.
06

Measurement Context

TPS is a variable metric, and its reported value must be interpreted with context.

  • Peak vs. Sustained TPS: Peak TPS is measured under ideal, saturated conditions (large batch, optimal input/output lengths). Sustained TPS reflects performance under a realistic, variable production workload and is typically lower.
  • Benchmark Conditions: Always ask: Batch size? Input/Output length? Hardware spec? Model precision? Without this context, a TPS number is meaningless for comparison.
  • Trade-off with Latency: Maximizing TPS often involves large batch sizes, which can increase latency for individual requests (higher TTFT, queue time). Engineering must balance throughput with latency Service Level Objectives (SLOs).
METRIC COMPARISON

TPS vs. Other Key LLM Performance Metrics

A comparison of Tokens per Second (TPS) with other critical metrics used to monitor and evaluate the performance, efficiency, and user experience of LLM inference systems.

MetricPrimary FocusKey DriverUser Experience ImpactOptimization Target

Tokens per Second (TPS)

System Throughput

Hardware FLOPs, Batch Size, Decode Efficiency

High throughput enables faster completion for batch jobs.

GPU Utilization, Continuous Batching, Kernel Optimization

Time to First Token (TTFT)

Response Responsiveness

Prefill Computation, Model Size, Request Queueing

Perceived startup delay for a new request.

Prefill Optimization, Speculative Decoding, Model Quantization

Inter-Token Latency

Streaming Fluency

Autoregressive Decode Cost, Memory Bandwidth

Perceived speed of text generation in a stream.

KV Cache Optimization, Attention Kernels, Memory I/O

Latency Percentiles (P90, P99)

Tail Latency & Consistency

System Noise, Garbage Collection, Multi-tenancy Interference

Worst-case experience for a fraction of users.

Resource Isolation, Load Shedding, Request Prioritization

Model Accuracy (e.g., Perplexity)

Output Quality

Model Architecture, Training Data, Fine-Tuning

Correctness and usefulness of the generated content.

Model Selection, Fine-Tuning, RAG Augmentation

Error Rate

System Reliability

Infrastructure Failures, Model Loading, Input Validation

Frequency of failed requests or erroneous outputs.

Health Checks, Graceful Degradation, Retry Logic

Cost per Token

Economic Efficiency

Hardware Cost, Cloud Pricing, Model Efficiency

Directly impacts operational budget and scalability.

Instance Selection, Spot Usage, Model Compression

INFERENCE OPTIMIZATION

Common Techniques for Optimizing TPS

Achieving high Tokens per Second requires a multi-faceted approach targeting computational efficiency, memory bandwidth, and system architecture. These core techniques are fundamental to reducing inference cost and latency.

01

Continuous Batching

Continuous batching dynamically groups incoming inference requests into a single computational batch, adding new requests as others finish generation. This maximizes GPU utilization by eliminating idle time between static batches.

  • Key Benefit: Dramatically improves hardware throughput compared to static batching.
  • Implementation: Used in serving frameworks like vLLM and TGI (Text Generation Inference).
  • Impact: Can increase effective TPS by 5-10x for workloads with variable request lengths.
02

KV Cache Optimization

The Key-Value (KV) Cache stores computed attention key and value vectors for previously generated tokens, preventing redundant computation during autoregressive decoding. Optimizing its management is critical for TPS.

  • Memory Bottleneck: KV cache size grows linearly with batch size and sequence length, often becoming the limiting factor.
  • Techniques: Include paged attention (vLLM) to eliminate memory fragmentation and quantized KV caches (e.g., FP8) to reduce memory bandwidth pressure.
  • Result: Efficient KV cache management directly reduces decode latency, boosting TPS.
03

Model Quantization

Quantization reduces the numerical precision of a model's weights and activations (e.g., from 16-bit to 8-bit or 4-bit). This decreases memory footprint and increases compute efficiency.

  • Types: Post-Training Quantization (PTQ) applies compression after training; Quantization-Aware Training (QAT) fine-tunes the model to compensate for precision loss.
  • Frameworks: GPTQ, AWQ, and bitsandbytes are common libraries.
  • TPS Gain: Can enable 2-4x higher batch sizes on the same hardware, proportionally increasing TPS.
04

Flash Attention & Kernel Fusion

Flash Attention is an optimized algorithm that computes exact attention with significantly fewer memory reads/writes (I/O operations) by fusing computation kernels. This reduces the time spent on the attention mechanism, which is often the computational bottleneck.

  • Principle: Leverages hardware memory hierarchy (SRAM vs. HBM) for efficient data movement.
  • Impact: Reduces prefill latency (improving TTFT) and increases overall compute throughput, contributing to higher TPS, especially for long contexts.
  • Extension: FlashAttention-2 and related kernel fusions for other operations (e.g., MLP layers) provide further gains.
05

Speculative Decoding

Speculative decoding uses a small, fast draft model to predict a sequence of several future tokens. These are then verified in a single forward pass by the larger target model, rejecting mismatches.

  • Mechanism: Amortizes the cost of the large model's forward pass over multiple tokens.
  • Requirement: The draft model must have high predictive accuracy (e.g., a distilled version of the target).
  • Outcome: Can achieve 2-3x latency reduction for the same model, effectively doubling or tripling TPS in supported scenarios.
06

Hardware-Specific Optimizations

Maximizing TPS requires tailoring the inference stack to the underlying accelerator architecture (GPU, NPU, etc.).

  • Kernel Optimization: Writing custom CUDA/HIP kernels or using compiler frameworks like OpenAI Triton for optimal hardware utilization.
  • Model Compilation: Using NVIDIA TensorRT or OpenXLA to compile the model graph into a single, optimized executable kernel.
  • Memory Allocation: Employing efficient, pre-allocated memory pools to reduce overhead from dynamic allocation during inference.
  • Result: These low-level optimizations provide the final performance layer, often yielding 20-50% TPS improvements over generic implementations.
LLM PERFORMANCE MONITORING

Frequently Asked Questions

Essential questions and answers about Tokens per Second (TPS), a critical throughput metric for evaluating and optimizing the performance of large language model inference systems.

Tokens per Second (TPS) is a throughput metric that quantifies the number of output tokens a large language model (LLM) inference system can generate in one second. It is measured by benchmarking the system under a specific load, typically reporting either a peak rate (maximum achievable under ideal conditions) or a sustained rate (average over a prolonged period). Measurement requires a controlled environment with defined parameters: hardware (e.g., GPU type and count), software stack (inference server, kernels), batch size, sequence lengths, and the specific model being served. The metric is calculated as (Total Output Tokens Generated) / (Total Wall-Clock Time) for a given benchmark run.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.