Glossary

Resiliency Score

A Resiliency Score is a composite metric, derived from SLIs like Self-Correction and Fallback Success Rates, that quantifies an autonomous agent's ability to maintain functionality in the face of errors.

Get in touch Learn more

Procurement manager reviewing autonomous AI agent dashboard on laptop, purchase orders visible, office afternoon light.

AGENTIC SLI/SLO DEFINITION

What is a Resiliency Score?

A Resiliency Score is a composite metric that quantifies an autonomous agent's ability to maintain functionality in the face of errors or external system failures.

A Resiliency Score is a composite Service Level Indicator (SLI) that mathematically combines underlying metrics like Self-Correction Success Rate and Fallback Success Rate to produce a single, normalized value representing an agent's overall robustness. It provides engineering leaders with a high-level, at-a-glance measure of system stability, abstracting the complexity of individual failure modes into a unified health indicator for operational dashboards and executive reporting.

This score is critical for Agentic SLO (Service Level Objective) definition, as it directly informs the error budget for reliability engineering. By tracking the Resiliency Score over time, teams can quantify degradation, trigger alerting rules for proactive intervention, and measure the impact of deployments on system stability, ensuring autonomous agents meet enterprise requirements for deterministic execution in production.

COMPOSITE METRIC

Key Components of a Resiliency Score

A Resiliency Score is not a single measurement but a composite metric derived from multiple, interdependent Service Level Indicators (SLIs). This score quantifies an autonomous agent's ability to withstand and recover from failures.

Self-Correction Success Rate

This core SLI measures the percentage of times an agent successfully identifies and remediates its own errors through recursive loops without human intervention. A high rate indicates robust internal error handling.

Mechanism: Tracks outcomes of reflection and replanning cycles.
Impact on Score: Directly increases the Resiliency Score, as it reflects autonomous fault tolerance.
Example: An agent failing a tool call, diagnosing the error (e.g., malformed parameters), and successfully retrying with a corrected payload.

Fallback Success Rate

This SLI measures the effectiveness of contingency logic, calculating the percentage of times an agent successfully executes an alternative path when its primary method fails. It validates the robustness of failover designs.

Mechanism: Monitors switches to predefined backup tools, models, or workflows.
Impact on Score: A high rate significantly bolsters the score, demonstrating graceful degradation.
Example: An LLM-based agent switching from a primary high-latency model to a faster, less capable model when SLOs for response time are at risk.

Retry Success Rate

This SLI quantifies the effectiveness of automatic retry logic for transient failures, calculated as the percentage of retried operations that ultimately succeed. It distinguishes between transient and permanent faults.

Mechanism: Evaluates success after a configured number of retries with optional backoff or parameter adjustment.
Impact on Score: A moderate positive impact; high success indicates resilience to flaky external dependencies.
Example: An agent retrying a failed API call to a payment gateway, succeeding on the second attempt after a network timeout.

Guardrail Compliance Rate

This safety-focused SLI measures the percentage of an agent's actions and outputs that adhere to predefined operational, safety, and ethical policy constraints. Resiliency includes maintaining safe operation under stress.

Mechanism: Checks outputs and planned actions against a rule engine or safety classifier.
Impact on Score: Non-compliance can severely degrade or nullify the score, as unsafe operation is a critical failure mode.
Example: An agent in a financial context rejecting a user request that would violate compliance rules, even if technically executable.

Health Check Success Rate

This availability SLI measures the percentage of periodic diagnostic probes (liveness and readiness checks) that pass, indicating the agent's operational stability and preparedness to accept tasks.

Mechanism: Runs synthetic transactions or internal state checks at regular intervals.
Impact on Score: A foundational component; consistently failing health checks indicates systemic instability, lowering the overall score.
Example: A readiness check verifying that an agent's memory vector store is connected and its core planning model is responsive.

Weighting and Normalization

The final Resiliency Score is calculated by applying defined weights to the normalized values of its constituent SLIs, then combining them (e.g., weighted average). This reflects organizational priorities.

Normalization: Individual SLI values (often percentages) are scaled to a common range (e.g., 0-1).
Weighting: Critical SLIs like Guardrail Compliance Rate carry more weight than others like Retry Success Rate.
Output: A single numerical score (e.g., 0-100 or 0-1) providing an at-a-glance assessment of agent robustness.

AGENTIC SLO/SLI COMPARISON

Resiliency Score vs. Other Agent Metrics

This table compares the Resiliency Score, a composite metric for agent robustness, against other key Agentic Service Level Indicators (SLIs) and operational metrics.

Metric / Feature	Resiliency Score	Core Agentic SLIs (e.g., Planning Success Rate)	Operational & Business Metrics (e.g., Cost, Throughput)
Primary Purpose	Quantifies overall agent robustness and fault tolerance	Measures a specific, atomic aspect of agent performance	Tracks system efficiency, cost, or business impact
Calculation Method	Composite formula (e.g., weighted average of SLIs like Self-Correction & Fallback Success Rates)	Direct measurement of a single event type (e.g., successful plans / total plans)	Direct measurement of resource use or output volume (e.g., total cost / tasks)
Granularity	High-level, summary score	Fine-grained, component-level	System or business-level
Predictive Value for Failures	High: Designed to forecast stability and need for intervention	Variable: May indicate specific failure modes	Low: Reflects outcomes, not root causes
Use in SLO Definition	Often the SLO target itself (e.g., Resiliency Score > 0.95)	Used as raw SLIs to build composite scores or SLOs	Used for budgeting and capacity planning, not typically as SLOs
Example Components	Self-Correction Success Rate, Fallback Success Rate, Retry Success Rate	Planning Success Rate, Action Success Ratio, Task Completion Rate	Cost Per Successful Task, Throughput (tasks/sec), Token Usage
Alerting Priority	High: A drop indicates systemic resilience issues	Medium: Triggers investigation into specific agent modules	Variable: Cost spikes may trigger alerts; throughput is monitored
Trend Analysis Value	Critical: Trends show improving or degrading system resilience over time	Important: Identifies regressions in specific capabilities	Essential: For capacity and financial forecasting

AGENTIC SLI/SLO DEFINITION

Frequently Asked Questions

A Resiliency Score is a composite metric central to Agentic Observability, quantifying an autonomous system's ability to withstand and recover from failures. These questions address its definition, calculation, and role in production assurance.

A Resiliency Score is a composite Service Level Indicator (SLI) that quantifies an autonomous agent's ability to maintain intended functionality and successfully complete tasks in the face of errors, external system failures, or unexpected conditions.

It is not a single raw measurement but a calculated value, often on a scale of 0-100 or 0-1, derived from combining multiple underlying Agentic SLIs that reflect recovery and robustness mechanisms. Key inputs typically include Self-Correction Success Rate, Fallback Success Rate, and Retry Success Rate. A high score indicates a system that can autonomously navigate failures, while a low score signals fragility and a high likelihood of requiring human intervention.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

AGENTIC SLI/SLO DEFINITION

Related Terms

A Resiliency Score is a composite metric derived from multiple underlying Service Level Indicators (SLIs). Understanding these component SLIs and related operational concepts is essential for defining and monitoring robust autonomous systems.