Inferensys

Glossary

Critical User Journey (CUJ)

A Critical User Journey (CUJ) is a specific, high-value sequence of user interactions with a service that is essential to the user's success and forms the basis for defining user-centric Service Level Objectives (SLOs).
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
SLO/SLI DEFINITION FOR AI

What is Critical User Journey (CUJ)?

A Critical User Journey (CUJ) is a specific, high-value sequence of user interactions with a service that is essential to the user's success and forms the basis for defining user-centric Service Level Objectives (SLOs).

A Critical User Journey (CUJ) is a defined, end-to-end sequence of user interactions that is essential to the core value proposition of a service. It represents a complete, high-value task from the user's perspective, such as 'successfully completing a purchase' or 'receiving a correct answer from a chatbot.' Unlike isolated metrics, a CUJ provides the holistic context needed to establish meaningful Service Level Objectives (SLOs) that directly reflect user satisfaction and business outcomes. This user-centric framing shifts reliability engineering from component-level monitoring to outcome-based guarantees.

In AI systems, defining CUJs is paramount for Evaluation-Driven Development. For a Retrieval-Augmented Generation (RAG) service, a CUJ could be 'user submits a query and receives a factually grounded answer.' This journey encompasses multiple technical steps—query understanding, retrieval, generation, and formatting—each with its own Service Level Indicator (SLI) like retrieval latency or answer faithfulness. By instrumenting and setting SLOs for the entire CUJ, teams ensure the system's technical performance is aligned with delivering reliable user value, enabling precise error budget management and prioritization.

EVALUATION-DRIVEN DEVELOPMENT

Key Characteristics of a Critical User Journey

A Critical User Journey (CUJ) is a specific, high-value sequence of user interactions essential to user success, forming the basis for user-centric Service Level Objectives (SLOs). These journeys are defined by several core characteristics that distinguish them from general user flows.

01

User-Centric & Business-Critical

A CUJ is defined from the user's perspective, not the system's architecture. It represents a complete, end-to-end task that delivers core value. For an e-commerce site, this is "search for product, add to cart, complete checkout." For an AI chatbot, it's "ask a complex question, receive a grounded, accurate answer." The journey's success is directly tied to key business outcomes like revenue, conversion, or user retention. Defining SLOs on CUJs ensures engineering efforts protect what matters most to users and the business.

02

Measurable with SLIs

Every step in a CUJ must be quantifiable using Service Level Indicators (SLIs). These are the raw metrics that indicate health and performance. For AI services, relevant SLIs include:

  • Model Inference Latency: Total time for request-to-response.
  • Time To First Token (TTFT): Responsiveness for streaming outputs.
  • Error Rate: Percentage of failed requests (e.g., 5xx errors, model crashes).
  • Task Success Rate: For agents, the percentage of journeys completed without intervention.
  • Answer Faithfulness Score: For RAG, the proportion of the answer supported by source context. These SLIs provide the data to evaluate whether the CUJ is meeting its Service Level Objective (SLO).
03

Defines SLOs & Error Budgets

The CUJ's SLIs are used to set a Service Level Objective (SLO), a target reliability over a time window (e.g., "99.9% of checkout journeys complete in under 2 seconds per page over 30 days"). The difference between 100% and the SLO is the Error Budget—the allowable unreliability. This budget becomes a central management tool:

  • Burn Rate: Measures how quickly the budget is consumed.
  • Release Gating: New features can be deployed if they don't risk exhausting the budget.
  • Priority Setting: Focuses engineering effort on fixes that protect the CUJ. This creates a feedback loop where user experience directly governs engineering priorities.
04

Composed of Dependencies

A single CUJ typically depends on multiple backend services, databases, and third-party APIs. For an AI-powered journey like "get a summarized answer from a document," dependencies include:

  • Vector Database for semantic retrieval.
  • Inference Endpoint for the LLM.
  • Authentication Service.
  • External Data APIs. The overall CUJ SLO is a Composite SLO derived from the individual SLOs of its components. This highlights systemic fragility through Tail Latency Amplification, where the slowest dependency (e.g., p99 database query) dictates the user's p99 experience.
05

Enables Proactive Observability

Instrumenting CUJs transforms monitoring from reactive to proactive. Instead of watching server CPU, teams monitor the Golden Signals (latency, traffic, errors, saturation) for the journey itself. This enables:

  • Multi-Window Alerting: Triggering alerts based on SLO burn rate across short (1hr) and long (30-day) windows to distinguish blips from sustained degradation.
  • Canary Analysis: Deploying new models or code to a fraction of CUJ traffic to validate SLO compliance before full rollout.
  • Graceful Degradation: Designing fallback mechanisms (e.g., returning a cached answer) to protect the CUJ's core SLO when a non-critical dependency fails.
06

AI-Specific Quality Dimensions

For AI services, CUJs must account for non-traditional quality metrics beyond latency and errors. These require specialized SLIs and SLOs:

  • Hallucination Rate: Target percentage of factually incorrect generations.
  • Retrieval Precision@K: Relevance of top documents fetched for a RAG system.
  • Instruction Following Accuracy: Adherence to prompt constraints.
  • Data Drift Detection: Monitoring for statistical shifts in input data that degrade model performance.
  • Cost per Inference: Balancing quality with infrastructure expenditure via a Cost Efficiency SLO. Defining CUJs forces explicit quantification of these probabilistic quality aspects, moving AI from art to engineered service.
GLOSSARY

How to Define a Critical User Journey for AI Services

A precise definition of the Critical User Journey (CUJ), a foundational concept for establishing user-centric reliability targets in AI-powered systems.

A Critical User Journey (CUJ) is a specific, high-value sequence of user interactions with a service that is essential to the user's success and forms the basis for defining user-centric Service Level Objectives (SLOs). Unlike generic system metrics, a CUJ maps a complete end-to-end workflow—such as a customer using a Retrieval-Augmented Generation (RAG) chatbot to find a precise answer in a knowledge base—allowing engineers to measure what matters most to the business outcome.

Defining a CUJ requires identifying the key Service Level Indicators (SLIs)—like Time To First Token (TTFT), answer faithfulness, and retrieval precision—that directly impact the user's ability to complete that journey successfully. This user-first approach ensures SLOs and error budgets protect tangible experience, not just backend uptime, and is fundamental to Evaluation-Driven Development for AI services.

COMPARISON

CUJ-Based SLIs vs. Traditional Infrastructure SLIs

This table contrasts the fundamental characteristics of Service Level Indicators (SLIs) defined around Critical User Journeys (CUJs) with those derived from traditional infrastructure monitoring.

CharacteristicCUJ-Based SLIsTraditional Infrastructure SLIs

Definitional Focus

End-to-end user experience and business outcome for a specific, high-value interaction sequence.

Health and performance of individual technical components or resources (e.g., servers, databases, APIs).

Primary Audience

Product managers, business stakeholders, SREs focused on user-centric reliability.

Infrastructure engineers, DevOps, SREs focused on system stability.

Example Metrics

Task success rate, end-to-end latency for a CUJ, answer faithfulness (for AI), checkout conversion rate.

CPU utilization, disk I/O, network packet loss, API endpoint error rate, container restart count.

Alignment with Business Value

Directly Measures User Impact

Reveals Composite System Issues

Granular Root Cause Isolation

Primary Use Case

Defining user-centric SLOs, prioritizing engineering work based on user pain, measuring product reliability.

Infrastructure capacity planning, component-level debugging, ensuring resource availability.

Alerting Fidelity for User Pain

High: Alerts correlate directly with degraded user experience.

Low: Alerts may not correspond to any user-visible issue (e.g., high CPU on a non-critical service).

Complexity of Implementation

Higher: Requires instrumenting multi-step workflows and synthesizing data from multiple systems.

Lower: Often available out-of-the-box from infrastructure monitoring tools.

CRITICAL USER JOURNEY (CUJ)

Frequently Asked Questions

A Critical User Journey (CUJ) is a specific, high-value sequence of user interactions that is essential to user success and forms the basis for user-centric Service Level Objectives (SLOs). These FAQs address its definition, implementation, and relationship to AI service reliability.

A Critical User Journey (CUJ) is a specific, end-to-end sequence of user interactions with a service that is essential to achieving a high-value outcome for the user. It is defined by identifying the most important workflows from the user's perspective, such as 'a customer successfully completing a purchase' or 'a developer getting an accurate API response within 200ms.' Defining a CUJ involves mapping the precise steps, system dependencies, and success criteria for that journey. For AI services, this often includes steps like query submission, context retrieval, model inference, and response streaming. The CUJ becomes the foundational unit for measuring user-centric reliability, as opposed to monitoring isolated system components.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.