Inferensys

Glossary

Prometheus

Prometheus is an open-source systems monitoring and alerting toolkit that collects time-series metrics via a pull model over HTTP, widely used for monitoring LLM serving infrastructure and application performance.
MLOps engineer reviewing model serving infrastructure on laptop, container orchestration visible, technical workspace.
LLM PERFORMANCE MONITORING

What is Prometheus?

Prometheus is the de facto open-source standard for monitoring and alerting on time-series metrics, forming the core telemetry backbone for modern LLM serving infrastructure.

Prometheus is an open-source systems monitoring and alerting toolkit that collects and stores time-series metrics using a pull model over HTTP. It is fundamentally designed for reliability, operating independently on local servers without reliance on distributed storage, and features a powerful multi-dimensional data model and a flexible query language called PromQL. For LLM operations, it is the primary system for scraping metrics from model servers, inference engines, and auxiliary services to track latency, throughput, error rates, and resource utilization.

In an LLM serving stack, Prometheus agents (exporters) collect metrics from endpoints exposed by components like vLLM or Triton Inference Server. Engineers use PromQL to query this data, create alerts for SLO violations, and visualize trends in tools like Grafana. Its pull-based architecture, where the monitoring server scrapes targets, contrasts with push-based systems and is complemented by the PushGateway for ephemeral jobs. This makes Prometheus essential for observing the health of continuous batching efficiency, KV cache usage, and token generation performance in production.

PROMETHEUS

Core Components & Architecture

Prometheus is an open-source systems monitoring and alerting toolkit that uses a pull model over HTTP to collect time-series metrics, widely used for monitoring the health and performance of LLM serving infrastructure and applications.

01

Pull-Based Metric Collection

Prometheus operates on a pull model, where its server actively scrapes metrics from configured HTTP endpoints exposed by applications. This contrasts with a push model. For LLM services, this involves instrumenting the application code or using exporters to expose key metrics like:

  • Request latency (TTFT, inter-token latency)
  • Throughput (tokens per second)
  • Error rates and HTTP status codes
  • GPU utilization and memory usage The server then stores these scraped metrics as time-series data in its local database.
02

Multi-Dimensional Data Model

Metrics in Prometheus are identified by a metric name and a set of key-value pairs called labels. This multi-dimensional data model allows for powerful filtering and aggregation. For LLM monitoring, labels can segment traffic by:

  • Model version (model="gpt-4")
  • Deployment endpoint (endpoint="/v1/completions")
  • User or tenant ID for multi-tenancy
  • Request characteristics (mode="streaming") A sample metric for LLM latency might be: llm_request_duration_seconds{model="llama-3-70b", endpoint="/chat", quantile="0.99"}.
03

PromQL Query Language

PromQL is Prometheus's functional query language used to select and aggregate time-series data in real-time. It is essential for creating dashboards, alerts, and ad-hoc analysis. Key operations for LLM monitoring include:

  • Rate calculation: rate(llm_requests_total[5m])
  • Aggregation across labels: sum by (model) (rate(llm_tokens_generated_total[5m]))
  • Percentile analysis: histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))
  • Comparative operators: For alerting when error rates spike. PromQL enables SREs to calculate Service Level Indicators (SLIs) like error rate and latency percentiles directly from raw metrics.
04

Alertmanager for Notification Routing

The Alertmanager service handles alerts sent by the Prometheus server. It is responsible for deduplication, grouping, silencing, inhibition, and routing of alerts to various receivers like email, Slack, or PagerDuty. For LLM SLOs, Alertmanager can be configured to:

  • Group all alerts related to a specific model deployment.
  • Throttle notifications to prevent alert fatigue during ongoing incidents.
  • Route critical alerts (e.g., SLO breach, P99 latency spike) to an on-call engineer while sending informational alerts (e.g., elevated token rate) to a chat channel. It works in conjunction with Prometheus's alerting rules defined in PromQL.
05

Exporters and Service Discovery

Prometheus uses exporters to collect metrics from systems that do not natively expose a Prometheus format. Key exporters for LLM infrastructure include:

  • Node Exporter for host-level metrics (CPU, memory, disk).
  • NVIDIA DCGM Exporter or GPU Operator for detailed GPU metrics.
  • cAdvisor for container resource usage. Service discovery automates the discovery of scrape targets in dynamic environments like Kubernetes. Prometheus can query the Kubernetes API to automatically find all running LLM inference pods and begin scraping their metrics, eliminating manual configuration as pods scale.
06

Integration with Grafana & OpenTelemetry

Prometheus is a core data source for Grafana, which provides visualization dashboards for LLM metrics. Dashboards display real-time charts for:

  • Latency percentiles (P50, P90, P99)
  • Request rate and error budget burn
  • GPU utilization and token throughput While Prometheus excels at metrics, it is often used alongside OpenTelemetry (OTel) for a complete observability picture. OTel handles distributed tracing for request flows across microservices (e.g., from API gateway to LLM to post-processing). Metrics from OTel can be exported to Prometheus, and traces can be correlated with metric alerts for faster root cause analysis (RCA).
INFRASTRUCTURE OBSERVABILITY

How Prometheus is Used for LLM & AI Monitoring

Prometheus is the de facto open-source standard for collecting and alerting on time-series metrics, providing the foundational telemetry layer for monitoring the health, performance, and cost of LLM serving infrastructure.

Prometheus implements a pull-based model, scraping metrics from instrumented LLM endpoints like inference servers, load balancers, and GPU nodes over HTTP. It stores this data as time-series identified by metric name and key-value labels (e.g., model_version, endpoint), enabling precise querying for metrics such as Tokens per Second, request latency, and error rates. This data forms the core dataset for performance dashboards and Service Level Objective compliance tracking.

For LLM-specific observability, Prometheus is extended with custom exporters that translate model-serving framework metrics (e.g., from vLLM or TGI) and application-level events into its format. Coupled with Grafana for visualization and Alertmanager for notifications, it creates a complete monitoring stack that alerts on anomalies, output drift, or infrastructure degradation, ensuring engineers can maintain reliable, performant LLM services.

MONITORING CATEGORIES

Common LLM & AI Metrics Tracked with Prometheus

A comparison of key operational, performance, and quality metrics for LLM services, detailing their purpose, typical measurement, and common alerting thresholds.

Metric Name & PurposePrometheus Metric TypeTypical Measurement / FormulaCommon Alerting Threshold (Example)

Request Rate (Throughput)

Counter

rate(llm_requests_total[5m])

Alert on sudden drop > 50%

Request Duration (Latency)

Histogram

histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))

P99 > 10s

Token Generation Rate

Gauge

rate(llm_tokens_generated_total[2m])

Tokens/sec < 10

Error Rate

Counter

rate(llm_request_errors_total[5m]) / rate(llm_requests_total[5m])

Error ratio > 0.01

Model Cache Hit Ratio

Gauge

llm_cache_hits / (llm_cache_hits + llm_cache_misses)

Ratio < 0.85

GPU Memory Utilization

Gauge

container_memory_usage_bytes{container="llm-inference"} / container_spec_memory_limit_bytes{container="llm-inference"}

Utilization > 90%

GPU Utilization

Gauge

DCGM_FI_DEV_GPU_UTIL

Utilization > 95% for 5m

Queue Length / Wait Time

Gauge

llm_request_queue_length

Queue length > 100

Output Token Count (per request)

Histogram

llm_output_tokens_count

Count > 4096 (context window limit)

Input Token Count (per request)

Histogram

llm_input_tokens_count

Count > 8192 (max supported)

PROMETHEUS

Frequently Asked Questions

Prometheus is the de facto standard for monitoring cloud-native applications and infrastructure. For LLM operations, it provides the foundational telemetry layer for tracking model performance, infrastructure health, and business metrics.

Prometheus is an open-source systems monitoring and alerting toolkit that collects and stores time-series metrics using a multi-dimensional data model and a powerful query language (PromQL). It operates on a pull model, where the Prometheus server scrapes HTTP endpoints (called /metrics) exposed by monitored targets at configured intervals. Collected metrics are stored locally in a custom, efficient time-series database. Its core architectural components include the main Prometheus server for scraping and storage, service discovery for dynamically finding targets, the Alertmanager for handling alerts, and various exporters for bridging third-party systems.

For LLM serving, Prometheus scrapes metrics from the inference server (e.g., vLLM, TGI), the application backend, and the underlying infrastructure (GPU, memory, network). This provides a unified view of Tokens per Second (TPS), latency percentiles (P99), error rates, and GPU utilization.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.