Inferensys

Glossary

Tokens Per Second (TPS)

Tokens Per Second (TPS) is a throughput metric that quantifies the number of output tokens a language model or AI agent can generate per second, indicating raw inference speed and system capacity.
MLOps engineer reviewing model serving infrastructure on laptop, container orchestration visible, technical workspace.
AGENT PERFORMANCE METRIC

What is Tokens Per Second (TPS)?

A core throughput metric for generative AI systems.

Tokens Per Second (TPS) is a throughput metric that quantifies the raw inference speed of a language model or AI agent by measuring the number of output tokens it can generate per second. It is a direct indicator of a model's computational efficiency and a key factor in determining the end-to-end latency and perceived responsiveness of an AI application. For agentic observability, TPS is a fundamental telemetry signal used to monitor system health and detect performance bottlenecks in production.

In practical terms, TPS is influenced by hardware (e.g., GPU type), model architecture, and inference optimization techniques like continuous batching. While high TPS indicates fast raw generation, it must be analyzed alongside Time to First Token (TTFT) and tail latency to fully understand user experience. For agent performance benchmarking, TPS is critical for capacity planning, cost per thousand tokens estimation, and ensuring systems meet defined Service Level Objectives (SLOs) for throughput.

SYSTEM ARCHITECTURE

Key Factors Influencing TPS

Tokens Per Second (TPS) is a critical throughput metric for AI inference. Its value is not intrinsic to a model but is determined by a complex interplay of hardware, software, and system design choices.

01

Model Architecture & Size

The fundamental design of the language model is the primary determinant of TPS.

  • Parameter Count: Larger models (e.g., 70B+ parameters) require more computations per token, reducing TPS compared to smaller models (e.g., 7B parameters) on identical hardware.
  • Architecture Family: Architectures like Mixture of Experts (MoE) can achieve higher TPS than dense models of equivalent parameter count by activating only a subset of weights per token.
  • Context Window Length: Processing very long sequences increases memory bandwidth pressure for the Key-Value (KV) Cache, which can bottleneck TPS.
02

Hardware & Compute

The physical infrastructure executing the model defines the upper bound for TPS.

  • GPU/Accelerator Type: Modern AI accelerators (e.g., NVIDIA H100, Google TPU v5e) with specialized tensor cores provide vastly higher TPS than general-purpose CPUs.
  • Memory Bandwidth: The speed at which model weights can be read from GPU VRAM (High Bandwidth Memory) is often the limiting factor for TPS, a phenomenon known as being memory-bound.
  • Batching Strategy: Continuous batching dynamically groups requests, maximizing GPU utilization and significantly improving aggregate TPS over static batching.
03

Inference Optimization

Software-level optimizations are essential for achieving peak hardware performance.

  • Quantization: Reducing model weight precision from 16-bit (FP16) to 8-bit (INT8) or 4-bit (NF4) halves or quarters memory bandwidth requirements, directly boosting TPS with minimal accuracy loss.
  • Kernel Optimization: Using highly optimized, low-level compute kernels (e.g., via FlashAttention) reduces the operational overhead of attention mechanisms.
  • Compiler Techniques: Frameworks like TensorRT-LLM or vLLM apply graph optimizations, kernel fusion, and expert memory management to maximize TPS.
04

System & Network Overhead

The surrounding serving infrastructure introduces latency that impacts effective TPS.

  • Pre/Post-Processing: Tokenization, detokenization, and output formatting add fixed latency per request, reducing overall system TPS.
  • Network Latency: In distributed systems, communication between orchestrators, model servers, and tokenizers adds delay.
  • Queuing & Scheduling: Under high load, request queuing and scheduler overhead can become the bottleneck, capping realized TPS below the hardware's theoretical maximum.
05

Decoding Strategy

The algorithm used to generate tokens directly controls the number of serial inference steps required.

  • Greedy Decoding: Produces one token per forward pass. It is the fastest (highest TPS) but can lead to repetitive or low-quality output.
  • Sampling (Top-k, Top-p): Introduces randomness by sampling from the probability distribution, maintaining quality with a negligible TPS impact versus greedy decoding.
  • Beam Search: Explores multiple sequence possibilities in parallel, requiring k times more computations (where k is the beam width), drastically reducing TPS.
06

Related Metric: Time to First Token (TTFT)

While TPS measures steady-state throughput, Time to First Token (TTFT) measures the latency to start streaming. They are often in tension.

  • High TPS, High TTFT: Systems optimized for large batch processing may have high TPS but also high TTFT as they wait to fill a batch.
  • Low TPS, Low TTFT: Systems prioritizing responsiveness may process requests immediately (low TTFT) but sacrifice aggregate TPS due to poor GPU utilization.
  • Optimization Goal: The ideal system balances both, using techniques like continuous batching and prefill-decoding separation to minimize TTFT while maximizing TPS.
AGENT PERFORMANCE METRICS

TPS vs. Latency Metrics: A Comparison

A comparison of throughput and latency metrics used to evaluate the performance of AI agents and language models, highlighting their distinct roles in benchmarking.

MetricTokens Per Second (TPS)End-to-End LatencyTime to First Token (TTFT)

Primary Measurement

Throughput (output rate)

Total request duration

Initial response delay

Key Performance Indicator For

Inference server efficiency, hardware utilization

Overall user experience, task completion time

Perceived responsiveness, streaming applications

Typical Unit

Tokens/second

Milliseconds (ms) or seconds (s)

Milliseconds (ms)

Impacted By

Batch size, model architecture, GPU memory bandwidth

Network latency, model compute time, external API calls, queuing

Model prefill computation, context length, cold starts

Relationship to Concurrency

Often increases with higher batch sizes up to a saturation point

Generally increases with higher concurrency due to queuing

Less directly affected by concurrency than total latency

Optimization Target

Maximize throughput for cost-efficient batch processing

Minimize latency for interactive, real-time applications

Minimize delay to start of stream for conversational agents

Use in SLOs/SLIs

For cost and capacity planning (e.g., min TPS under load)

For user experience guarantees (e.g., P99 latency < 2s)

For streaming quality (e.g., TTFT < 500ms)

Directly Measures

Raw computational speed of token generation

Holistic system performance from request to final output

Time to begin delivering the output stream

AGENT PERFORMANCE BENCHMARKING

Frequently Asked Questions

Essential questions and answers about Tokens Per Second (TPS), a core throughput metric for measuring the raw inference speed of language models and AI agents in production.

Tokens Per Second (TPS) is a throughput metric that quantifies the number of output tokens a language model or AI agent can generate per second, indicating its raw inference speed. It is measured by dividing the total number of tokens in a generated output sequence by the wall-clock time taken to produce that sequence, excluding initial prompt processing and network overhead. High-throughput inference engines use techniques like continuous batching to aggregate multiple requests, maximizing GPU utilization and TPS. This metric is distinct from end-user perceived latency (like Time to First Token) and is a critical benchmark for infrastructure cost and scalability.

Key Measurement Contexts:

  • Peak TPS: Maximum throughput under optimal, saturated load.
  • Sustained TPS: Average throughput over a prolonged period, accounting for system variability.
  • It is typically measured server-side on the inference hardware.
Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.