Prometheus is an open-source systems monitoring and alerting toolkit that collects and stores time-series metrics using a pull model over HTTP. It is fundamentally designed for reliability, operating independently on local servers without reliance on distributed storage, and features a powerful multi-dimensional data model and a flexible query language called PromQL. For LLM operations, it is the primary system for scraping metrics from model servers, inference engines, and auxiliary services to track latency, throughput, error rates, and resource utilization.
Glossary
Prometheus

What is Prometheus?
Prometheus is the de facto open-source standard for monitoring and alerting on time-series metrics, forming the core telemetry backbone for modern LLM serving infrastructure.
In an LLM serving stack, Prometheus agents (exporters) collect metrics from endpoints exposed by components like vLLM or Triton Inference Server. Engineers use PromQL to query this data, create alerts for SLO violations, and visualize trends in tools like Grafana. Its pull-based architecture, where the monitoring server scrapes targets, contrasts with push-based systems and is complemented by the PushGateway for ephemeral jobs. This makes Prometheus essential for observing the health of continuous batching efficiency, KV cache usage, and token generation performance in production.
Core Components & Architecture
Prometheus is an open-source systems monitoring and alerting toolkit that uses a pull model over HTTP to collect time-series metrics, widely used for monitoring the health and performance of LLM serving infrastructure and applications.
Pull-Based Metric Collection
Prometheus operates on a pull model, where its server actively scrapes metrics from configured HTTP endpoints exposed by applications. This contrasts with a push model. For LLM services, this involves instrumenting the application code or using exporters to expose key metrics like:
- Request latency (TTFT, inter-token latency)
- Throughput (tokens per second)
- Error rates and HTTP status codes
- GPU utilization and memory usage The server then stores these scraped metrics as time-series data in its local database.
Multi-Dimensional Data Model
Metrics in Prometheus are identified by a metric name and a set of key-value pairs called labels. This multi-dimensional data model allows for powerful filtering and aggregation. For LLM monitoring, labels can segment traffic by:
- Model version (
model="gpt-4") - Deployment endpoint (
endpoint="/v1/completions") - User or tenant ID for multi-tenancy
- Request characteristics (
mode="streaming") A sample metric for LLM latency might be:llm_request_duration_seconds{model="llama-3-70b", endpoint="/chat", quantile="0.99"}.
PromQL Query Language
PromQL is Prometheus's functional query language used to select and aggregate time-series data in real-time. It is essential for creating dashboards, alerts, and ad-hoc analysis. Key operations for LLM monitoring include:
- Rate calculation:
rate(llm_requests_total[5m]) - Aggregation across labels:
sum by (model) (rate(llm_tokens_generated_total[5m])) - Percentile analysis:
histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m])) - Comparative operators: For alerting when error rates spike. PromQL enables SREs to calculate Service Level Indicators (SLIs) like error rate and latency percentiles directly from raw metrics.
Alertmanager for Notification Routing
The Alertmanager service handles alerts sent by the Prometheus server. It is responsible for deduplication, grouping, silencing, inhibition, and routing of alerts to various receivers like email, Slack, or PagerDuty. For LLM SLOs, Alertmanager can be configured to:
- Group all alerts related to a specific model deployment.
- Throttle notifications to prevent alert fatigue during ongoing incidents.
- Route critical alerts (e.g., SLO breach, P99 latency spike) to an on-call engineer while sending informational alerts (e.g., elevated token rate) to a chat channel. It works in conjunction with Prometheus's alerting rules defined in PromQL.
Exporters and Service Discovery
Prometheus uses exporters to collect metrics from systems that do not natively expose a Prometheus format. Key exporters for LLM infrastructure include:
- Node Exporter for host-level metrics (CPU, memory, disk).
- NVIDIA DCGM Exporter or GPU Operator for detailed GPU metrics.
- cAdvisor for container resource usage. Service discovery automates the discovery of scrape targets in dynamic environments like Kubernetes. Prometheus can query the Kubernetes API to automatically find all running LLM inference pods and begin scraping their metrics, eliminating manual configuration as pods scale.
Integration with Grafana & OpenTelemetry
Prometheus is a core data source for Grafana, which provides visualization dashboards for LLM metrics. Dashboards display real-time charts for:
- Latency percentiles (P50, P90, P99)
- Request rate and error budget burn
- GPU utilization and token throughput While Prometheus excels at metrics, it is often used alongside OpenTelemetry (OTel) for a complete observability picture. OTel handles distributed tracing for request flows across microservices (e.g., from API gateway to LLM to post-processing). Metrics from OTel can be exported to Prometheus, and traces can be correlated with metric alerts for faster root cause analysis (RCA).
How Prometheus is Used for LLM & AI Monitoring
Prometheus is the de facto open-source standard for collecting and alerting on time-series metrics, providing the foundational telemetry layer for monitoring the health, performance, and cost of LLM serving infrastructure.
Prometheus implements a pull-based model, scraping metrics from instrumented LLM endpoints like inference servers, load balancers, and GPU nodes over HTTP. It stores this data as time-series identified by metric name and key-value labels (e.g., model_version, endpoint), enabling precise querying for metrics such as Tokens per Second, request latency, and error rates. This data forms the core dataset for performance dashboards and Service Level Objective compliance tracking.
For LLM-specific observability, Prometheus is extended with custom exporters that translate model-serving framework metrics (e.g., from vLLM or TGI) and application-level events into its format. Coupled with Grafana for visualization and Alertmanager for notifications, it creates a complete monitoring stack that alerts on anomalies, output drift, or infrastructure degradation, ensuring engineers can maintain reliable, performant LLM services.
Common LLM & AI Metrics Tracked with Prometheus
A comparison of key operational, performance, and quality metrics for LLM services, detailing their purpose, typical measurement, and common alerting thresholds.
| Metric Name & Purpose | Prometheus Metric Type | Typical Measurement / Formula | Common Alerting Threshold (Example) |
|---|---|---|---|
Request Rate (Throughput) | Counter | rate(llm_requests_total[5m]) | Alert on sudden drop > 50% |
Request Duration (Latency) | Histogram | histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m])) | P99 > 10s |
Token Generation Rate | Gauge | rate(llm_tokens_generated_total[2m]) | Tokens/sec < 10 |
Error Rate | Counter | rate(llm_request_errors_total[5m]) / rate(llm_requests_total[5m]) | Error ratio > 0.01 |
Model Cache Hit Ratio | Gauge | llm_cache_hits / (llm_cache_hits + llm_cache_misses) | Ratio < 0.85 |
GPU Memory Utilization | Gauge | container_memory_usage_bytes{container="llm-inference"} / container_spec_memory_limit_bytes{container="llm-inference"} | Utilization > 90% |
GPU Utilization | Gauge | DCGM_FI_DEV_GPU_UTIL | Utilization > 95% for 5m |
Queue Length / Wait Time | Gauge | llm_request_queue_length | Queue length > 100 |
Output Token Count (per request) | Histogram | llm_output_tokens_count | Count > 4096 (context window limit) |
Input Token Count (per request) | Histogram | llm_input_tokens_count | Count > 8192 (max supported) |
Frequently Asked Questions
Prometheus is the de facto standard for monitoring cloud-native applications and infrastructure. For LLM operations, it provides the foundational telemetry layer for tracking model performance, infrastructure health, and business metrics.
Prometheus is an open-source systems monitoring and alerting toolkit that collects and stores time-series metrics using a multi-dimensional data model and a powerful query language (PromQL). It operates on a pull model, where the Prometheus server scrapes HTTP endpoints (called /metrics) exposed by monitored targets at configured intervals. Collected metrics are stored locally in a custom, efficient time-series database. Its core architectural components include the main Prometheus server for scraping and storage, service discovery for dynamically finding targets, the Alertmanager for handling alerts, and various exporters for bridging third-party systems.
For LLM serving, Prometheus scrapes metrics from the inference server (e.g., vLLM, TGI), the application backend, and the underlying infrastructure (GPU, memory, network). This provides a unified view of Tokens per Second (TPS), latency percentiles (P99), error rates, and GPU utilization.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Prometheus operates within a broader ecosystem of observability tools and methodologies essential for monitoring LLM performance. The following terms represent key concepts and complementary technologies used to build a complete monitoring stack.
Distributed Tracing
A method of profiling requests as they flow through a distributed LLM application stack by recording timing and metadata for individual operations (spans) across service boundaries. While Prometheus aggregates metrics, distributed tracing provides a detailed, request-level view. It is critical for diagnosing latency issues in complex pipelines involving multiple microservices, databases, and external API calls. Implemented using standards like OpenTelemetry Trace, it helps answer questions like:
- Which service or model stage is causing high P99 latency?
- What is the breakdown of time spent in prefill vs. decode phases?
- How does a retrieval-augmented generation (RAG) pipeline spend its time between retrieval and generation?
Service Level Objective (SLO)
A target value or range for a Service Level Indicator that defines the acceptable performance and reliability of an LLM-powered service. SLOs are the contractual heart of site reliability engineering (SRE). Prometheus metrics are used to measure SLI compliance and calculate the error budget—the allowable amount of unreliability. For LLM services, common SLOs are defined for:
- Latency: "99% of requests must have a Time to First Token < 500ms."
- Availability: "The model API must be available 99.95% of the time."
- Quality: "Less than 1% of responses may contain a detected hallucination." Violating an SLO consumes the error budget, which guides the pace and risk of new deployments.
Canary & Shadow Deployment
Release strategies for safely deploying new LLM models or application versions by exposing them to a controlled subset of traffic. These strategies rely heavily on Prometheus for comparative metric analysis.
- Canary Deployment: The new version serves a small percentage of live traffic (e.g., 5%). Prometheus metrics for the canary cohort (latency, error rate, TPS) are compared against the baseline to validate performance before a full rollout.
- Shadow Deployment: The new version processes all live requests in parallel, but its outputs are discarded. This allows for full-scale performance and correctness testing (e.g., comparing output embeddings for drift) with zero user risk. Prometheus monitors resource consumption of the shadow model.
Statistical Process Control (SPC)
A method of quality control using statistical methods, like control charts, to monitor and control a process. In LLM operations, SPC is applied to Prometheus metrics to distinguish normal variance from significant anomalies that require intervention. Key concepts include:
- Control Limits: Statistically derived bounds (e.g., 3-sigma) within which metric variation is considered common-cause.
- Mean Time to Recovery (MTTR): A key SPC-derived metric measuring the average time to restore service after a control limit breach.
- Root Cause Analysis (RCA): The systematic process triggered by an SPC alert to find the fundamental cause of a metric anomaly, such as a model regression, infrastructure failure, or data drift.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us