Inferensys

Glossary

Completeness Score

A completeness score is a quantitative metric in machine learning explainability that measures whether a generated explanation accounts for all input features that contributed significantly to a model's prediction.
ML engineer managing model training cluster on laptop, GPU utilization visible, technical deep learning setup.
EXPLAINABILITY SCORE VALIDATION

What is Completeness Score?

A quantitative metric for evaluating the sufficiency of post-hoc model explanations.

A completeness score is a metric that evaluates whether an explanation accounts for all features or factors that contributed significantly to a model's prediction. It is a core component of post-hoc explanation validation, measuring the sufficiency of an explanation's selected features. The score quantifies how much of the model's output variation is captured by the subset of features highlighted as important, ensuring the explanation is not missing critical, influential factors. High completeness indicates a faithful summary of the model's reasoning for that specific instance.

The score is calculated by comparing the model's original prediction to its output when only the features identified by the explanation are retained. Common techniques involve perturbation analysis, where non-important features are masked or replaced with baseline values. A perfect completeness score of 1.0 signifies the explanation's feature set alone is sufficient for the model to replicate its original prediction. It is often evaluated alongside the faithfulness score and stability score to provide a holistic assessment of explanation quality in Explainability Score Validation frameworks.

EXPLAINABILITY SCORE VALIDATION

Key Characteristics of Completeness Score

The completeness score is a core metric in explainable AI (XAI) that quantifies whether a post-hoc explanation for a model's prediction accounts for all significant contributing factors. It is a critical component of rigorous explanation validation.

01

Definition and Core Purpose

A completeness score is a quantitative metric that evaluates the degree to which a feature attribution or explanation accounts for the total prediction made by a model. Its core purpose is to answer: 'Does this explanation capture all the important reasons for this prediction?' It is calculated by comparing the sum of the importance scores assigned by the explanation to the actual model output deviation from a baseline. A low score indicates the explanation is missing key predictive factors.

02

Mathematical Foundation

The score is often grounded in the efficiency axiom from cooperative game theory, as applied by methods like SHAP. Formally, for a model f and explanation φ, completeness requires: f(x) - f(x') = Σ φ_i, where x is the input instance, x' is a baseline input (e.g., all features masked), and φ_i is the attribution for feature i. The completeness score C can be expressed as C = 1 - |(f(x) - f(x')) - Σ φ_i| / |f(x) - f(x')|. A perfect score of 1.0 indicates the attributions sum exactly to the model's output difference.

03

Relationship to Faithfulness and Sufficiency

Completeness is one of three pillars of explanation quality, distinct from but related to faithfulness and sufficiency.

  • Faithfulness: Measures if the explanation's ranked importance correlates with the actual impact of features on the model's output. An explanation can be faithful but incomplete.
  • Sufficiency: Measures if the top-K features identified by an explanation are, by themselves, sufficient for the model to make the same prediction. Sufficiency is a related but different property.
  • Completeness: Ensures the magnitude of all attributions accounts for the total model output. A complete explanation is necessary for full accountability.
04

Validation via Perturbation Analysis

A primary method for empirically validating completeness is systematic perturbation analysis. This involves:

  • Occlusion Tests: Iteratively removing or masking features deemed important by the explanation and observing the prediction change.
  • Expected Change Calculation: The sum of prediction changes from occluding each feature should approximate the total deviation of the prediction from its baseline.
  • Gap Measurement: The discrepancy between the expected total change (from the explanation) and the actual total model output change is the infidelity, which is inversely related to completeness. A high infidelity score indicates low completeness.
05

Challenges and Practical Limitations

Achieving a perfect completeness score in practice is challenging due to:

  • Non-Linear Interactions: In complex models, feature effects are not additive. The impact of occluding two features together may not equal the sum of occluding them individually, violating the linear assumption of some attribution methods.
  • Baseline Sensitivity: The score is sensitive to the choice of baseline input (e.g., all zeros, average values). Different baselines yield different output deviations (f(x)-f(x')), changing the completeness calculation.
  • Explanation Method Constraints: Some popular explanation methods (e.g., LIME) are not designed to be inherently complete. Their local surrogate models may not perfectly recover the complex model's output, leading to an inherent completeness gap.
06

Role in Enterprise AI Governance

For regulated industries and high-stakes applications, completeness is a non-negotiable component of the explanation audit trail. It provides:

  • Quantitative Justification: Offers a numerical score to prove an explanation is comprehensive, moving beyond qualitative assessment.
  • Regulatory Compliance: Supports adherence to principles like 'right to explanation' in regulations such as the EU AI Act, by demonstrating that explanations are thorough and account for the full decision.
  • Risk Mitigation: Incomplete explanations can hide model reliance on spurious correlations or protected attributes. A high completeness score increases confidence that all decision drivers are visible for human review and bias auditing.
EXPLAINABILITY SCORE VALIDATION

How is Completeness Score Calculated?

The completeness score is a quantitative metric used to validate post-hoc explanations in machine learning by measuring the degree to which they account for the model's prediction.

A completeness score is calculated by comparing the sum of feature attribution values for a selected subset of important features against the model's total output deviation from a baseline. Formally, it is often defined as the ratio of the sum of Shapley values for the top-K features to the sum of all Shapley values. A score of 1.0 indicates the explanation accounts for 100% of the model's prediction relative to the baseline, while lower scores suggest missing explanatory factors. This metric directly assesses the sufficiency of an explanation.

In practice, calculation involves generating attributions using a method like SHAP or Integrated Gradients, selecting the most important features, and computing the ratio. It is a core component of post-hoc explanation validation, often used alongside faithfulness and stability scores. A low completeness score signals that the explanation is incomplete, potentially omitting critical features, which undermines trust and algorithmic explainability for stakeholders like data scientists and regulatory teams.

EXPLANATION VALIDATION

Completeness Score vs. Other Explanation Metrics

A comparison of quantitative metrics used to assess the quality and faithfulness of post-hoc explanations for machine learning model predictions.

Metric / PropertyCompleteness ScoreFaithfulness ScoreStability Score

Core Definition

Measures if an explanation accounts for all significant contributing features.

Measures how accurately an explanation reflects the model's true reasoning.

Measures the consistency of explanations for similar or perturbed inputs.

Primary Goal

Ensure no critical factor is omitted from the explanation.

Ensure the explanation's importance scores match the model's internal causality.

Ensure the explanation method is robust and not overly sensitive to noise.

Validation Method

Perturbation analysis: removing top-k features should cause a large prediction change.

Perturbation analysis: correlating feature importance with prediction delta upon removal.

Generate explanations for inputs within a local neighborhood and measure variance.

Output

Scalar score (typically 0 to 1). Higher is better.

Scalar score (typically 0 to 1). Higher is better.

Scalar score (e.g., Jaccard similarity, rank correlation). Higher is better.

Relation to Sufficiency

Directly related; a complete explanation is a sufficient one.

Distinct; a faithful explanation may not be sufficient if it omits minor factors.

Orthogonal; an explanation can be stable but incomplete or unfaithful.

Relation to Infidelity

Inverse relationship; high completeness implies low infidelity for significant features.

Direct inverse; infidelity quantifies the failure of faithfulness.

Largely independent; infidelity measures per-instance error, not cross-instance consistency.

Model-Agnostic

Common Use Case

Auditing for regulatory compliance; ensuring explanations are not misleading by omission.

Debugging model logic; validating that explanation tools are probing the correct mechanisms.

Assessing explanation reliability for production deployment and user trust.

EXPLAINABILITY SCORE VALIDATION

Frequently Asked Questions

These questions address the Completeness Score, a core metric for evaluating whether an explanation for a model's prediction accounts for all significant contributing factors.

A Completeness Score is a quantitative metric that evaluates whether a post-hoc explanation for a model's prediction accounts for all features or factors that contributed significantly to that specific output. It measures the sufficiency of an explanation by assessing if the subset of features identified as important is, by itself, enough for the model to replicate its original prediction with high confidence. A high score indicates the explanation is comprehensive and captures the model's true reasoning, while a low score suggests critical drivers were omitted, making the explanation potentially misleading.

This score is distinct from related metrics like Faithfulness Score (which measures if the explanation reflects the model's actual process) and Stability Score (which measures consistency across similar inputs). Completeness is a cornerstone of Post-hoc Explanation Validation, ensuring explanations are not just plausible but exhaustively account for the prediction.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.