Inferensys

Glossary

Integrated Gradients

Integrated Gradients is a feature attribution method for explaining neural network predictions by integrating the model's gradients along a straight-line path from a baseline input to the actual input.
Data engineer managing feature store on laptop, feature definitions visible, casual data engineering session.
EXPLAINABILITY SCORE VALIDATION

What is Integrated Gradients?

Integrated Gradients is a foundational feature attribution method in machine learning explainability that quantifies the contribution of each input feature to a model's prediction.

Integrated Gradients is a feature attribution method that assigns an importance score to each input feature by calculating the path integral of the model's gradients along a straight-line path from a baseline input (e.g., a black image or zero vector) to the actual input. This technique satisfies two key axioms: Sensitivity and Implementation Invariance. It provides a principled, model-agnostic approach for explaining predictions from complex deep neural networks, making it a core tool for post-hoc explanation validation.

The method's output is a vector of attribution scores, often visualized as a saliency map for image data. Its faithfulness is evaluated using metrics like the completeness score, which ensures attributions sum to the difference between the model's prediction for the input and the baseline. As a model-agnostic technique, it is directly applicable to any differentiable model, including those within Retrieval-Augmented Generation or vision-language-action models, providing crucial insights for algorithmic explainability and interpretability audits.

INTEGRATED GRADIENTS

Core Axioms and Theoretical Properties

Integrated Gradients is a feature attribution method that assigns importance scores by integrating the model's gradients along a straight-line path from a baseline input to the actual input. The following cards detail its foundational axioms and validation properties.

01

Completeness Axiom

The Completeness Axiom (or Summation to Difference) is the fundamental property that ensures the attribution scores for all input features sum to the difference between the model's output for the input and its output for the baseline. Formally, for input (x) and baseline (x'), the attributions (a_i) satisfy: (\sum_i a_i = F(x) - F(x')). This guarantees the explanation accounts for the entire prediction delta, providing a natural scale for importance scores.

02

Implementation Invariance

Implementation Invariance ensures that two functionally equivalent models (i.e., models that produce identical outputs for all inputs, regardless of their internal architecture or implementation details) will receive identical feature attributions. This axiom is critical because it means Integrated Gradients explains the function the model computes, not the idiosyncrasies of its implementation. It distinguishes the method from approaches that are sensitive to internal parameterization.

03

Sensitivity

The Sensitivity axiom states that if a model's prediction function does not depend mathematically on a feature, and that feature differs between the input and the baseline, then the attribution for that feature should be zero. Conversely, if the function depends solely on a feature and that feature differs, the attribution should be non-zero. This ensures the method correctly identifies features that are causally relevant to the prediction, avoiding false positives.

04

Linearity

The Linearity axiom posits that for a linear combination of two models, the attributions for the combined model are a weighted sum of the individual attributions. Formally, if (F = aF_1 + bF_2), then the attributions for (F) are (a) times the attributions for (F_1) plus (b) times the attributions for (F_2). This property ensures the explanation method behaves predictably and consistently across model ensembles or linearly composed functions.

05

Baseline Selection & Sensitivity

The choice of baseline is a critical, non-axiomatic parameter that significantly influences the resulting attributions. The baseline represents an input with 'no information' (e.g., a black image, a zero vector, or an average embedding).

  • Impact: Attributions explain the prediction relative to the baseline. A poorly chosen baseline can yield uninterpretable results.
  • Common Practices: Use a neutral reference (like zero), a distributional average, or a counterfactual input representing an 'absence' of the predicted class.
06

Path Methods & The Straight-Line Path

Integrated Gradients is part of the path methods family, which integrate gradients along a path from baseline to input. The straight-line path (\gamma(\alpha) = x' + \alpha(x - x')) for (\alpha) from 0 to 1 is the simplest and most common.

  • Advantages: It is symmetric and satisfies the completeness axiom.
  • Alternatives: Other paths (e.g., monotonic) are possible but may violate desirable axioms. The straight-line path provides a unique solution satisfying all core axioms.
FEATURE COMPARISON

Integrated Gradients vs. Other Attribution Methods

A technical comparison of feature attribution methods based on their theoretical properties, computational requirements, and suitability for explainability score validation.

Methodological Property / MetricIntegrated GradientsGradient-Based (e.g., Saliency, Grad-CAM)Perturbation-Based (e.g., LIME, SHAP)Occlusion Sensitivity

Theoretical Foundation

Axiomatic (Completeness, Sensitivity, Implementation Invariance)

Local linear approximation via gradients

Local surrogate modeling / Cooperative game theory

Brute-force input perturbation

Path Requirement

Requires integration path from baseline to input

Requires only the input point

Requires local sampling around the input

Requires systematic region masking

Baseline Sensitivity

High (scores depend on chosen baseline)

None

Low to Moderate (sampling distribution matters)

None

Implementation Invariance

True (guarantees identical scores for functionally equivalent models)

False (scores can vary for functionally equivalent models)

True for SHAP (model-agnostic), False for model-specific variants

True (model-agnostic)

Computational Cost

Moderate-High (requires multiple gradient calculations along path)

Low (single forward/backward pass)

High (requires many model evaluations for sampling)

Very High (requires model evaluation per occluded region)

Explanation Sparsity

Low (typically assigns non-zero scores to many features)

Low (gradients are often dense)

Configurable (can be tuned for sparsity)

Configurable (depends on occlusion mask granularity)

Faithfulness Guarantees

High (directly integrates the model's true gradient function)

Moderate (approximates local decision boundary)

Varies (depends on fidelity of local surrogate model)

High (directly measures output change from real perturbations)

Suitability for Deep Networks

True

True

True

True

Native Support for Images

True

True (e.g., Grad-CAM)

True (requires image-specific segmentation)

True

Native Support for Text

True (with embedding baselines)

True (with gradient w.r.t. embeddings)

True

True (by masking tokens)

Standardized Quantitative Evaluation

Infidelity, Sensitivity-n

Not commonly standardized

Faithfulness, Stability

Faithfulness (by definition)

EXPLAINABILITY SCORE VALIDATION

Practical Implementation and Considerations

While Integrated Gradients provides a theoretically sound attribution method, its practical utility depends on careful implementation and rigorous validation against established metrics.

01

Choosing the Baseline

The baseline is a critical hyperparameter representing an 'informationless' input. Common choices include:

  • Zero vector: A simple all-zero input.
  • Mean/Median feature values: Represents an average input.
  • Random noise: A random sample from the input distribution.
  • Counterfactual baseline: An input representing an opposite class (e.g., a blank image for an object classifier).

The choice significantly impacts attributions. A zero baseline for an image model highlights all non-zero pixels, while a blurred version highlights edges. The baseline should be justified by domain knowledge.

02

Approximating the Integral

The integral along the straight-line path is approximated numerically. The key parameter is the number of steps (m).

  • Trapezoidal rule: The default, summing gradients at interpolated points.
  • Left Riemann sum: Uses gradients at the baseline and intermediate points.
  • Right Riemann sum: Uses gradients at intermediate points and the actual input.

A common heuristic is to use 20-50 steps. Too few steps cause approximation error; too many increase compute cost with diminishing returns. Convergence should be checked by observing attribution stability as m increases.

03

Validating with Completeness (Axiom)

The Completeness Axiom (or Summation to Delta) is the core theoretical guarantee: the sum of Integrated Gradients attributions equals the difference between the model's output for the input and the baseline. This serves as a primary implementation sanity check.

  • Calculation: sum(attributions_i) ≈ F(input) - F(baseline).
  • Deviation: Any significant deviation indicates a bug in the gradient computation or numerical integration.
  • Use: This axiom is used to verify the correctness of the implementation before any analysis, ensuring the explanation accounts for the entire prediction delta.
04

Assessing Sensitivity & Infidelity

A robust explanation should be stable under small input perturbations. Key validation metrics include:

  • Sensitivity-n: Measures the maximum change in attribution when up to n features are perturbed. Lower scores indicate greater robustness.
  • Infidelity: Quantifies the error between the explanation's importance scores and the actual change in model output when the input is perturbed according to the explanation. High infidelity suggests the explanation does not faithfully reflect model behavior.
  • Implementation: These metrics require generating many perturbed samples (e.g., by adding Gaussian noise) and recomputing predictions and attributions, making them computationally intensive but essential for reliability assessment.
05

Comparison to SHAP & LIME

Integrated Gradients is one of several popular attribution methods. Key distinctions:

  • vs. SHAP: SHAP is also grounded in game theory (Shapley values) but can be computationally expensive. KernelSHAP is model-agnostic but approximate; DeepSHAP is a faster, model-specific approximation. IG is typically more efficient for differentiable models.
  • vs. LIME: LIME fits a local surrogate model (e.g., linear) to explain a prediction. While intuitive, LIME explanations can be unstable and may not faithfully represent the complex model's true decision boundary. IG provides exact attributions for the original model.
  • Selection Guide: Use IG for differentiable models (neural networks) where implementation efficiency and theoretical guarantees are priorities. Use SHAP or LIME for non-differentiable models (e.g., tree ensembles).
06

Visualization & Human Evaluation

For image and text models, attributions must be presented intuitively for human analysts.

  • Images: Overlay attribution scores as a heatmap (saliency map) on the original image. Use divergent color scales (e.g., red-blue) to show positive/negative contributions.
  • Text: Highlight tokens in the input text with color intensity proportional to attribution score.
  • Human-AI Agreement: The ultimate test is whether the highlighted features align with domain expert intuition for a set of canonical examples. Low agreement may indicate issues with the baseline choice, model logic, or the need for concept-based methods like TCAV.
INTEGRATED GRADIENTS

Frequently Asked Questions

Integrated Gradients is a foundational technique in explainable AI (XAI) for attributing a model's prediction to its input features. This FAQ addresses its core mechanics, implementation, and role in validation.

Integrated Gradients is a feature attribution method that assigns an importance score to each input feature by integrating the model's gradients along a straight-line path from a baseline input (a neutral reference point, like a black image or zero vector) to the actual input. The core axiom it satisfies is completeness, meaning the sum of the attribution scores for all features equals the difference between the model's output for the input and the baseline. This provides a principled, model-agnostic way to explain predictions from complex models like deep neural networks.

Key Mathematical Formulation: For an input x and baseline x', the attribution for the i-th feature is:

python
IntegratedGrads_i(x) = (x_i - x'_i) × ∫_{α=0}^{1} (∂F(x' + α(x - x')) / ∂x_i) dα

Where F is the model function. The integral is approximated numerically (e.g., using the Trapezoidal rule).

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.