A completeness score is a metric that evaluates whether an explanation accounts for all features or factors that contributed significantly to a model's prediction. It is a core component of post-hoc explanation validation, measuring the sufficiency of an explanation's selected features. The score quantifies how much of the model's output variation is captured by the subset of features highlighted as important, ensuring the explanation is not missing critical, influential factors. High completeness indicates a faithful summary of the model's reasoning for that specific instance.
Glossary
Completeness Score

What is Completeness Score?
A quantitative metric for evaluating the sufficiency of post-hoc model explanations.
The score is calculated by comparing the model's original prediction to its output when only the features identified by the explanation are retained. Common techniques involve perturbation analysis, where non-important features are masked or replaced with baseline values. A perfect completeness score of 1.0 signifies the explanation's feature set alone is sufficient for the model to replicate its original prediction. It is often evaluated alongside the faithfulness score and stability score to provide a holistic assessment of explanation quality in Explainability Score Validation frameworks.
Key Characteristics of Completeness Score
The completeness score is a core metric in explainable AI (XAI) that quantifies whether a post-hoc explanation for a model's prediction accounts for all significant contributing factors. It is a critical component of rigorous explanation validation.
Definition and Core Purpose
A completeness score is a quantitative metric that evaluates the degree to which a feature attribution or explanation accounts for the total prediction made by a model. Its core purpose is to answer: 'Does this explanation capture all the important reasons for this prediction?' It is calculated by comparing the sum of the importance scores assigned by the explanation to the actual model output deviation from a baseline. A low score indicates the explanation is missing key predictive factors.
Mathematical Foundation
The score is often grounded in the efficiency axiom from cooperative game theory, as applied by methods like SHAP. Formally, for a model f and explanation φ, completeness requires: f(x) - f(x') = Σ φ_i, where x is the input instance, x' is a baseline input (e.g., all features masked), and φ_i is the attribution for feature i. The completeness score C can be expressed as C = 1 - |(f(x) - f(x')) - Σ φ_i| / |f(x) - f(x')|. A perfect score of 1.0 indicates the attributions sum exactly to the model's output difference.
Relationship to Faithfulness and Sufficiency
Completeness is one of three pillars of explanation quality, distinct from but related to faithfulness and sufficiency.
- Faithfulness: Measures if the explanation's ranked importance correlates with the actual impact of features on the model's output. An explanation can be faithful but incomplete.
- Sufficiency: Measures if the top-K features identified by an explanation are, by themselves, sufficient for the model to make the same prediction. Sufficiency is a related but different property.
- Completeness: Ensures the magnitude of all attributions accounts for the total model output. A complete explanation is necessary for full accountability.
Validation via Perturbation Analysis
A primary method for empirically validating completeness is systematic perturbation analysis. This involves:
- Occlusion Tests: Iteratively removing or masking features deemed important by the explanation and observing the prediction change.
- Expected Change Calculation: The sum of prediction changes from occluding each feature should approximate the total deviation of the prediction from its baseline.
- Gap Measurement: The discrepancy between the expected total change (from the explanation) and the actual total model output change is the infidelity, which is inversely related to completeness. A high infidelity score indicates low completeness.
Challenges and Practical Limitations
Achieving a perfect completeness score in practice is challenging due to:
- Non-Linear Interactions: In complex models, feature effects are not additive. The impact of occluding two features together may not equal the sum of occluding them individually, violating the linear assumption of some attribution methods.
- Baseline Sensitivity: The score is sensitive to the choice of baseline input (e.g., all zeros, average values). Different baselines yield different output deviations (
f(x)-f(x')), changing the completeness calculation. - Explanation Method Constraints: Some popular explanation methods (e.g., LIME) are not designed to be inherently complete. Their local surrogate models may not perfectly recover the complex model's output, leading to an inherent completeness gap.
Role in Enterprise AI Governance
For regulated industries and high-stakes applications, completeness is a non-negotiable component of the explanation audit trail. It provides:
- Quantitative Justification: Offers a numerical score to prove an explanation is comprehensive, moving beyond qualitative assessment.
- Regulatory Compliance: Supports adherence to principles like 'right to explanation' in regulations such as the EU AI Act, by demonstrating that explanations are thorough and account for the full decision.
- Risk Mitigation: Incomplete explanations can hide model reliance on spurious correlations or protected attributes. A high completeness score increases confidence that all decision drivers are visible for human review and bias auditing.
How is Completeness Score Calculated?
The completeness score is a quantitative metric used to validate post-hoc explanations in machine learning by measuring the degree to which they account for the model's prediction.
A completeness score is calculated by comparing the sum of feature attribution values for a selected subset of important features against the model's total output deviation from a baseline. Formally, it is often defined as the ratio of the sum of Shapley values for the top-K features to the sum of all Shapley values. A score of 1.0 indicates the explanation accounts for 100% of the model's prediction relative to the baseline, while lower scores suggest missing explanatory factors. This metric directly assesses the sufficiency of an explanation.
In practice, calculation involves generating attributions using a method like SHAP or Integrated Gradients, selecting the most important features, and computing the ratio. It is a core component of post-hoc explanation validation, often used alongside faithfulness and stability scores. A low completeness score signals that the explanation is incomplete, potentially omitting critical features, which undermines trust and algorithmic explainability for stakeholders like data scientists and regulatory teams.
Completeness Score vs. Other Explanation Metrics
A comparison of quantitative metrics used to assess the quality and faithfulness of post-hoc explanations for machine learning model predictions.
| Metric / Property | Completeness Score | Faithfulness Score | Stability Score |
|---|---|---|---|
Core Definition | Measures if an explanation accounts for all significant contributing features. | Measures how accurately an explanation reflects the model's true reasoning. | Measures the consistency of explanations for similar or perturbed inputs. |
Primary Goal | Ensure no critical factor is omitted from the explanation. | Ensure the explanation's importance scores match the model's internal causality. | Ensure the explanation method is robust and not overly sensitive to noise. |
Validation Method | Perturbation analysis: removing top-k features should cause a large prediction change. | Perturbation analysis: correlating feature importance with prediction delta upon removal. | Generate explanations for inputs within a local neighborhood and measure variance. |
Output | Scalar score (typically 0 to 1). Higher is better. | Scalar score (typically 0 to 1). Higher is better. | Scalar score (e.g., Jaccard similarity, rank correlation). Higher is better. |
Relation to Sufficiency | Directly related; a complete explanation is a sufficient one. | Distinct; a faithful explanation may not be sufficient if it omits minor factors. | Orthogonal; an explanation can be stable but incomplete or unfaithful. |
Relation to Infidelity | Inverse relationship; high completeness implies low infidelity for significant features. | Direct inverse; infidelity quantifies the failure of faithfulness. | Largely independent; infidelity measures per-instance error, not cross-instance consistency. |
Model-Agnostic | |||
Common Use Case | Auditing for regulatory compliance; ensuring explanations are not misleading by omission. | Debugging model logic; validating that explanation tools are probing the correct mechanisms. | Assessing explanation reliability for production deployment and user trust. |
Frequently Asked Questions
These questions address the Completeness Score, a core metric for evaluating whether an explanation for a model's prediction accounts for all significant contributing factors.
A Completeness Score is a quantitative metric that evaluates whether a post-hoc explanation for a model's prediction accounts for all features or factors that contributed significantly to that specific output. It measures the sufficiency of an explanation by assessing if the subset of features identified as important is, by itself, enough for the model to replicate its original prediction with high confidence. A high score indicates the explanation is comprehensive and captures the model's true reasoning, while a low score suggests critical drivers were omitted, making the explanation potentially misleading.
This score is distinct from related metrics like Faithfulness Score (which measures if the explanation reflects the model's actual process) and Stability Score (which measures consistency across similar inputs). Completeness is a cornerstone of Post-hoc Explanation Validation, ensuring explanations are not just plausible but exhaustively account for the prediction.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Completeness is one of several quantitative metrics used to validate the quality of post-hoc explanations for AI model predictions. These related terms define the broader framework for assessing explanation faithfulness, robustness, and utility.
Faithfulness Score
A faithfulness score is a quantitative metric that measures how accurately an explanation reflects the true reasoning process or causal factors of the underlying model for a given prediction. It is the core property that completeness, sufficiency, and infidelity all aim to quantify.
- Direct Measurement: Often evaluated via perturbation analysis, where features deemed important by the explanation are modified to see if the prediction changes as expected.
- Contrast with Completeness: While completeness asks 'did the explanation include all important features?', faithfulness asks a broader question: 'is the explanation's entire importance ranking correct?'
Sufficiency
Sufficiency is an explanation metric that measures whether the subset of features identified as most important by an explanation is, by itself, sufficient for the model to make its original prediction. It tests for predictive power of the highlighted features.
- Evaluation Method: The top-k most important features (according to the explanation) are fed to the model, often with other features masked or set to a baseline. A high sufficiency score means this subset alone yields a prediction nearly identical to the full-input prediction.
- Relation to Completeness: Completeness and sufficiency are complementary. A complete explanation that includes all important features will naturally be sufficient, but a sufficient explanation may not be complete if it identifies a powerful but incomplete subset.
Infidelity
Infidelity is an explanation metric that quantifies the degree to which an explanation fails to accurately reflect the model's output when the input is perturbed according to the explanation's importance scores. It is a measure of explanation error.
- Calculation: Given an importance score vector, significant perturbations (e.g., removing top features) are applied to the input. Infidelity measures the expected difference between (1) the dot product of the importance vector and the perturbation vector, and (2) the actual change in the model's output.
- Inverse Relationship: A low infidelity score indicates high faithfulness. It directly penalizes explanations where the attributed importance does not correlate with the actual impact on the model's prediction under perturbation.
Perturbation Analysis
Perturbation analysis is a foundational explanation validation technique that systematically modifies or removes input features to observe the resulting changes in the model's output. It is the empirical basis for calculating completeness, faithfulness, and sufficiency.
- Methods: Includes occlusion sensitivity (for images), feature ablation, or adding noise.
- Ground Truth Proxy: The change in prediction due to a perturbation is treated as a proxy for that feature's 'true' importance, against which explanation methods are benchmarked.
- Challenges: Requires careful design of meaningful perturbations and baselines to avoid misleading results.
Explanation Robustness
Explanation robustness refers to the property of an explanation method to produce consistent and stable attributions for a given prediction when the input or model is subjected to minor, semantically-preserving perturbations. It is concerned with the reliability of the explanation method itself.
- Stability Score: A related metric that quantifies this consistency.
- Contrast with Faithfulness: An explanation can be robust (consistent under small noise) but not faithful (consistently wrong). Conversely, a faithful explanation should be robust to semantically-invariant changes.
- Importance: Lack of robustness undermines user trust, as similar inputs yield wildly different explanations.
Local Fidelity
Local fidelity is a property of a post-hoc explanation that measures how well the explanation approximates the behavior of the complex model in the immediate vicinity of a specific input instance. It is the foundational promise of local explanation methods like LIME.
- Definition: A high-fidelity explanation acts as a locally accurate surrogate model.
- Evaluation: Measured by how well the explanation's simplified model (e.g., a linear model) predicts the outputs of the black-box model for perturbed samples near the instance of interest.
- Foundation for Validation: Metrics like completeness are only meaningful if the explanation method has high local fidelity to begin with; otherwise, it is explaining a surrogate that doesn't match the original model.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us