Glossary

Prometheus

Prometheus is an open-source systems monitoring and alerting toolkit that collects time-series metrics via a pull model over HTTP, widely used for monitoring LLM serving infrastructure and application performance.

Get in touch Learn more

MLOps engineer reviewing model serving infrastructure on laptop, container orchestration visible, technical workspace.

LLM PERFORMANCE MONITORING

What is Prometheus?

Prometheus is the de facto open-source standard for monitoring and alerting on time-series metrics, forming the core telemetry backbone for modern LLM serving infrastructure.

Prometheus is an open-source systems monitoring and alerting toolkit that collects and stores time-series metrics using a pull model over HTTP. It is fundamentally designed for reliability, operating independently on local servers without reliance on distributed storage, and features a powerful multi-dimensional data model and a flexible query language called PromQL. For LLM operations, it is the primary system for scraping metrics from model servers, inference engines, and auxiliary services to track latency, throughput, error rates, and resource utilization.

In an LLM serving stack, Prometheus agents (exporters) collect metrics from endpoints exposed by components like vLLM or Triton Inference Server. Engineers use PromQL to query this data, create alerts for SLO violations, and visualize trends in tools like Grafana. Its pull-based architecture, where the monitoring server scrapes targets, contrasts with push-based systems and is complemented by the PushGateway for ephemeral jobs. This makes Prometheus essential for observing the health of continuous batching efficiency, KV cache usage, and token generation performance in production.

PROMETHEUS

Core Components & Architecture

Prometheus is an open-source systems monitoring and alerting toolkit that uses a pull model over HTTP to collect time-series metrics, widely used for monitoring the health and performance of LLM serving infrastructure and applications.

Pull-Based Metric Collection

Prometheus operates on a pull model, where its server actively scrapes metrics from configured HTTP endpoints exposed by applications. This contrasts with a push model. For LLM services, this involves instrumenting the application code or using exporters to expose key metrics like:

Request latency (TTFT, inter-token latency)
Throughput (tokens per second)
Error rates and HTTP status codes
GPU utilization and memory usage The server then stores these scraped metrics as time-series data in its local database.

Multi-Dimensional Data Model

Metrics in Prometheus are identified by a metric name and a set of key-value pairs called labels. This multi-dimensional data model allows for powerful filtering and aggregation. For LLM monitoring, labels can segment traffic by:

Model version (model="gpt-4")
Deployment endpoint (endpoint="/v1/completions")
User or tenant ID for multi-tenancy
Request characteristics (mode="streaming") A sample metric for LLM latency might be: llm_request_duration_seconds{model="llama-3-70b", endpoint="/chat", quantile="0.99"}.

PromQL Query Language

PromQL is Prometheus's functional query language used to select and aggregate time-series data in real-time. It is essential for creating dashboards, alerts, and ad-hoc analysis. Key operations for LLM monitoring include:

Rate calculation: rate(llm_requests_total[5m])
Aggregation across labels: sum by (model) (rate(llm_tokens_generated_total[5m]))
Percentile analysis: histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))
Comparative operators: For alerting when error rates spike. PromQL enables SREs to calculate Service Level Indicators (SLIs) like error rate and latency percentiles directly from raw metrics.

Alertmanager for Notification Routing

The Alertmanager service handles alerts sent by the Prometheus server. It is responsible for deduplication, grouping, silencing, inhibition, and routing of alerts to various receivers like email, Slack, or PagerDuty. For LLM SLOs, Alertmanager can be configured to:

Group all alerts related to a specific model deployment.
Throttle notifications to prevent alert fatigue during ongoing incidents.
Route critical alerts (e.g., SLO breach, P99 latency spike) to an on-call engineer while sending informational alerts (e.g., elevated token rate) to a chat channel. It works in conjunction with Prometheus's alerting rules defined in PromQL.

Exporters and Service Discovery

Prometheus uses exporters to collect metrics from systems that do not natively expose a Prometheus format. Key exporters for LLM infrastructure include:

Node Exporter for host-level metrics (CPU, memory, disk).
NVIDIA DCGM Exporter or GPU Operator for detailed GPU metrics.
cAdvisor for container resource usage. Service discovery automates the discovery of scrape targets in dynamic environments like Kubernetes. Prometheus can query the Kubernetes API to automatically find all running LLM inference pods and begin scraping their metrics, eliminating manual configuration as pods scale.

Integration with Grafana & OpenTelemetry

Prometheus is a core data source for Grafana, which provides visualization dashboards for LLM metrics. Dashboards display real-time charts for:

Latency percentiles (P50, P90, P99)
Request rate and error budget burn
GPU utilization and token throughput While Prometheus excels at metrics, it is often used alongside OpenTelemetry (OTel) for a complete observability picture. OTel handles distributed tracing for request flows across microservices (e.g., from API gateway to LLM to post-processing). Metrics from OTel can be exported to Prometheus, and traces can be correlated with metric alerts for faster root cause analysis (RCA).

INFRASTRUCTURE OBSERVABILITY

How Prometheus is Used for LLM & AI Monitoring

Prometheus is the de facto open-source standard for collecting and alerting on time-series metrics, providing the foundational telemetry layer for monitoring the health, performance, and cost of LLM serving infrastructure.

Prometheus implements a pull-based model, scraping metrics from instrumented LLM endpoints like inference servers, load balancers, and GPU nodes over HTTP. It stores this data as time-series identified by metric name and key-value labels (e.g., model_version, endpoint), enabling precise querying for metrics such as Tokens per Second, request latency, and error rates. This data forms the core dataset for performance dashboards and Service Level Objective compliance tracking.

For LLM-specific observability, Prometheus is extended with custom exporters that translate model-serving framework metrics (e.g., from vLLM or TGI) and application-level events into its format. Coupled with Grafana for visualization and Alertmanager for notifications, it creates a complete monitoring stack that alerts on anomalies, output drift, or infrastructure degradation, ensuring engineers can maintain reliable, performant LLM services.

MONITORING CATEGORIES

Common LLM & AI Metrics Tracked with Prometheus

A comparison of key operational, performance, and quality metrics for LLM services, detailing their purpose, typical measurement, and common alerting thresholds.

Metric Name & Purpose	Prometheus Metric Type	Typical Measurement / Formula	Common Alerting Threshold (Example)
Request Rate (Throughput)	Counter	rate(llm_requests_total[5m])	Alert on sudden drop > 50%
Request Duration (Latency)	Histogram	histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))	P99 > 10s
Token Generation Rate	Gauge	rate(llm_tokens_generated_total[2m])	Tokens/sec < 10
Error Rate	Counter	rate(llm_request_errors_total[5m]) / rate(llm_requests_total[5m])	Error ratio > 0.01
Model Cache Hit Ratio	Gauge	llm_cache_hits / (llm_cache_hits + llm_cache_misses)	Ratio < 0.85
GPU Memory Utilization	Gauge	container_memory_usage_bytes{container="llm-inference"} / container_spec_memory_limit_bytes{container="llm-inference"}	Utilization > 90%
GPU Utilization	Gauge	DCGM_FI_DEV_GPU_UTIL	Utilization > 95% for 5m
Queue Length / Wait Time	Gauge	llm_request_queue_length	Queue length > 100
Output Token Count (per request)	Histogram	llm_output_tokens_count	Count > 4096 (context window limit)
Input Token Count (per request)	Histogram	llm_input_tokens_count	Count > 8192 (max supported)

PROMETHEUS

Frequently Asked Questions

Prometheus is the de facto standard for monitoring cloud-native applications and infrastructure. For LLM operations, it provides the foundational telemetry layer for tracking model performance, infrastructure health, and business metrics.

Prometheus is an open-source systems monitoring and alerting toolkit that collects and stores time-series metrics using a multi-dimensional data model and a powerful query language (PromQL). It operates on a pull model, where the Prometheus server scrapes HTTP endpoints (called /metrics) exposed by monitored targets at configured intervals. Collected metrics are stored locally in a custom, efficient time-series database. Its core architectural components include the main Prometheus server for scraping and storage, service discovery for dynamically finding targets, the Alertmanager for handling alerts, and various exporters for bridging third-party systems.

For LLM serving, Prometheus scrapes metrics from the inference server (e.g., vLLM, TGI), the application backend, and the underlying infrastructure (GPU, memory, network). This provides a unified view of Tokens per Second (TPS), latency percentiles (P99), error rates, and GPU utilization.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

LLM PERFORMANCE MONITORING

Related Terms

Prometheus operates within a broader ecosystem of observability tools and methodologies essential for monitoring LLM performance. The following terms represent key concepts and complementary technologies used to build a complete monitoring stack.

OpenTelemetry (OTel)

A vendor-neutral, open-source observability framework for generating, collecting, and exporting telemetry data—traces, metrics, and logs—from LLM applications. While Prometheus excels at metrics, OpenTelemetry provides a unified standard for all three pillars of observability. It is often used to instrument LLM applications, with metrics then exported to Prometheus for storage and alerting. Key components include:

OTel Collector: A vendor-agnostic proxy for receiving, processing, and exporting telemetry.
Instrumentation Libraries: For auto-instrumenting frameworks and manual code instrumentation.
Semantic Conventions: Standardized attribute names for consistent data across services.

EXPLORE

Grafana Dashboards

Customizable visualization interfaces that query and display time-series metrics from data sources like Prometheus. Grafana is the primary tool for creating operational dashboards to monitor LLM performance indicators in real-time. Typical panels for LLM monitoring include:

Latency Percentiles (P50, P90, P99): Visualizing tail latency for Time to First Token and Inter-Token Latency.
Throughput: Tokens per Second and requests per second.
Error Rates: 4xx/5xx HTTP status codes and model inference errors.
Resource Utilization: GPU memory, compute utilization, and KV cache efficiency.
Business Metrics: Token consumption per user or cost per request.

EXPLORE

Distributed Tracing

A method of profiling requests as they flow through a distributed LLM application stack by recording timing and metadata for individual operations (spans) across service boundaries. While Prometheus aggregates metrics, distributed tracing provides a detailed, request-level view. It is critical for diagnosing latency issues in complex pipelines involving multiple microservices, databases, and external API calls. Implemented using standards like OpenTelemetry Trace, it helps answer questions like:

Which service or model stage is causing high P99 latency?
What is the breakdown of time spent in prefill vs. decode phases?
How does a retrieval-augmented generation (RAG) pipeline spend its time between retrieval and generation?

Service Level Objective (SLO)

A target value or range for a Service Level Indicator that defines the acceptable performance and reliability of an LLM-powered service. SLOs are the contractual heart of site reliability engineering (SRE). Prometheus metrics are used to measure SLI compliance and calculate the error budget—the allowable amount of unreliability. For LLM services, common SLOs are defined for:

Latency: "99% of requests must have a Time to First Token < 500ms."
Availability: "The model API must be available 99.95% of the time."
Quality: "Less than 1% of responses may contain a detected hallucination." Violating an SLO consumes the error budget, which guides the pace and risk of new deployments.

Canary & Shadow Deployment

Release strategies for safely deploying new LLM models or application versions by exposing them to a controlled subset of traffic. These strategies rely heavily on Prometheus for comparative metric analysis.

Canary Deployment: The new version serves a small percentage of live traffic (e.g., 5%). Prometheus metrics for the canary cohort (latency, error rate, TPS) are compared against the baseline to validate performance before a full rollout.
Shadow Deployment: The new version processes all live requests in parallel, but its outputs are discarded. This allows for full-scale performance and correctness testing (e.g., comparing output embeddings for drift) with zero user risk. Prometheus monitors resource consumption of the shadow model.

Statistical Process Control (SPC)

A method of quality control using statistical methods, like control charts, to monitor and control a process. In LLM operations, SPC is applied to Prometheus metrics to distinguish normal variance from significant anomalies that require intervention. Key concepts include:

Control Limits: Statistically derived bounds (e.g., 3-sigma) within which metric variation is considered common-cause.
Mean Time to Recovery (MTTR): A key SPC-derived metric measuring the average time to restore service after a control limit breach.
Root Cause Analysis (RCA): The systematic process triggered by an SPC alert to find the fundamental cause of a metric anomaly, such as a model regression, infrastructure failure, or data drift.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.