Inferensys

Glossary

Real User Monitoring (RUM)

Real User Monitoring (RUM) is a performance monitoring technique that collects and analyzes metrics from actual user interactions with a live application to understand real-world experience.
Analytics team reviewing AI metrics dashboard on large monitor, KPIs visible, modern data-driven office setup.
PRODUCTION CANARY ANALYSIS

What is Real User Monitoring (RUM)?

Real User Monitoring (RUM) is a performance monitoring technique that collects and analyzes metrics from actual user interactions with a live application to understand real-world experience, including page load times and JavaScript errors.

Real User Monitoring (RUM) is a passive performance monitoring technique that instruments a web or mobile application to collect telemetry from actual user sessions. It captures frontend metrics like page load time, First Contentful Paint (FCP), and JavaScript error rates directly from the user's browser or device. This provides a ground-truth view of end-user experience (EUX) across different geographies, devices, and network conditions, contrasting with synthetic monitoring which uses simulated transactions.

Within Evaluation-Driven Development, RUM is critical for production canary analysis. By comparing RUM metrics—such as Core Web Vitals and Apdex scores—between a baseline (control) and a new model or feature release (canary), teams can make data-driven deployment verdicts. This real-world feedback loop validates that changes do not degrade user-perceived performance before a progressive rollout, directly supporting Service Level Objective (SLO) compliance for AI-powered services.

PRODUCTION CANARY ANALYSIS

Core RUM Metrics for AI/ML Systems

Real User Monitoring (RUM) provides the ground-truth telemetry for evaluating live AI systems. These metrics are critical for validating canary deployments and ensuring new models meet user-facing performance and quality standards.

01

Inference Latency (P50, P95, P99)

The time elapsed from a user's request to the delivery of the model's final output, measured as percentiles. This is the primary user-perceived performance metric for interactive AI features.

  • P50 (Median): Represents the typical user experience.
  • P95/P99 (Tail Latency): Critical for understanding worst-case scenarios, which often correlate with user abandonment. A spike in P99 latency during a canary is a strong rollback signal.
  • Example: A chatbot's response time or an image generation model's time-to-first-token.
02

Model Error Rate & Fallback Rate

The percentage of user requests where the model fails to produce a valid, usable response, triggering either an error or a fallback to a default/heuristic system.

  • HTTP 5xx Errors: Indicate infrastructure failures (e.g., GPU OOM, container crashes).
  • Application Errors: Include malformed outputs, serialization failures, or context window overflows.
  • Fallback Rate: Tracks how often a safety net or less-capable model is invoked. A rising fallback rate in a canary suggests the new model is less reliable than the champion.
03

Business & Quality KPIs

Domain-specific success metrics tied directly to user satisfaction and business outcomes. These are the ultimate determinants of a model's value.

  • For a RAG System: Click-through rate on cited sources, session length.
  • For a Chatbot: Conversation completion rate, user satisfaction score (post-interaction survey).
  • For a Recommendation Model: Conversion rate, add-to-cart rate.
  • For Code Generation: Acceptance rate of suggested code, developer edit distance.
  • A successful canary must show non-inferiority or improvement in these KPIs.
04

Token Usage & Throughput

Measures of computational resource consumption and system capacity derived from real user traffic patterns.

  • Tokens per Request: Directly correlates with cost for LLM APIs (e.g., OpenAI, Anthropic). A canary model generating longer outputs can significantly increase operational expenses.
  • Requests per Second (RPS): Indicates the load pattern and helps validate autoscaling configurations.
  • Concurrent User Sessions: Gauges the system's ability to handle stateful, multi-turn interactions under load.
05

Client-Side Stability Metrics

Metrics capturing failures or degradations in the user's browser or mobile application when interacting with the AI service.

  • JavaScript Error Rate: Errors from the frontend SDK or widget integrating the model.
  • Web Vitals for AI Features: Largest Contentful Paint (LCP) for AI-generated content, Interaction to Next Paint (INP) for chat interfaces.
  • Mobile App Crashes: Crashes attributed to the native SDK handling model responses.
  • These metrics are essential for full-stack canary analysis, as a model change can inadvertently break client-side integrations.
06

Geographic & Demographic Performance

The segmentation of core RUM metrics by user location, device type, or other relevant attributes to ensure equitable performance.

  • Regional Latency: Model inference latency for users in Europe vs. Asia-Pacific, which may be routed to different data centers.
  • Device Performance: Latency and error rate on low-end mobile devices versus desktop computers.
  • Key Use: Detecting performance regression bias where a new model performs well for one user cohort but poorly for another, which would be masked by global averages.
PRODUCTION CANARY ANALYSIS

How Real User Monitoring Works

Real User Monitoring (RUM) is a passive performance monitoring technique that collects telemetry from actual user sessions in a live application to measure real-world experience.

Real User Monitoring (RUM) works by injecting a lightweight JavaScript agent into a web or mobile application. This agent passively collects performance metrics like page load times, First Input Delay (FID), and JavaScript error rates directly from the user's browser or device. The data is sent to a collection endpoint, where it is aggregated and analyzed to create a performance profile based on real user geography, device type, and network conditions, providing a ground-truth view of application health.

Within Production Canary Analysis, RUM data is critical for comparing the new canary version against the stable baseline. By segmenting RUM metrics by deployment version, engineers can detect if the new release introduces regressions in Core Web Vitals or increased error rates for the exposed user subset. This real-user feedback complements synthetic monitoring and system metrics, enabling a data-driven deployment verdict to promote or roll back based on actual user impact.

MONITORING METHODOLOGIES

RUM vs. Synthetic Monitoring for AI Systems

A comparison of two primary monitoring approaches for evaluating AI system performance in production, highlighting their distinct roles in the canary analysis workflow.

Monitoring DimensionReal User Monitoring (RUM)Synthetic Monitoring

Data Source

Actual, anonymized user sessions and interactions.

Scripted, simulated transactions from predefined locations.

Primary Objective

Measure real-world user experience (UX) and business impact.

Proactively verify system availability, functionality, and performance under controlled conditions.

Detection Capability

End-to-end latency, JavaScript errors, Core Web Vitals, region-specific slowdowns, and unexpected user behavior patterns.

Uptime/downtime, API response correctness, baseline performance SLIs, and geographic latency from test points.

Context for AI/ML

Measures actual inference latency, model output quality (via downstream user actions), and drift impact on real user journeys.

Validates model endpoint health, performs scheduled regression tests on new model versions, and establishes performance baselines.

Use in Canary Analysis

Critical for the final verdict. Compares real user KPIs (e.g., conversion rate, session duration) between control and canary groups.

Used for initial smoke tests and pre-deployment validation. Ensures the canary is functionally operational before receiving live traffic.

Coverage

Limited to areas with actual user traffic. New features or low-traffic paths may have sparse data.

Provides consistent, global coverage for critical user paths and APIs, regardless of live traffic volume.

Alerting Nature

Reactive and historical. Alerts on degradations that have already affected users.

Proactive and predictive. Alerts on failures before significant user impact occurs.

Implementation Complexity

High. Requires instrumentation across the client-side application and careful data sampling.

Moderate. Relies on external probe scripts or internal synthetic agents with defined test scenarios.

REAL USER MONITORING (RUM)

Frequently Asked Questions

Real User Monitoring (RUM) is a critical technique in Production Canary Analysis for evaluating AI systems. It provides the ground-truth data on how new models perform for actual users, enabling data-driven deployment decisions.

Real User Monitoring (RUM) is a performance monitoring technique that collects and analyzes metrics from actual user interactions with a live application to understand real-world experience. It works by injecting a lightweight JavaScript agent into a web or mobile application, which passively captures performance data as users navigate and interact. This agent records key metrics like page load times, First Input Delay (FID), Cumulative Layout Shift (CLS), and JavaScript errors, then sends this telemetry to a backend analytics platform for aggregation and visualization. Unlike synthetic monitoring, which uses simulated traffic, RUM provides insights into the true performance experienced by your entire user base across different devices, networks, and geographies.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.