Isotonic regression is a post-hoc calibration method that fits a piecewise constant, monotonically non-decreasing function to map a classifier's raw output scores to calibrated probabilities. As a non-parametric technique, it makes minimal assumptions about the underlying score distribution, allowing it to model complex, non-linear miscalibration patterns. It is applied using a held-out calibration set distinct from the training and test data. The method is often contrasted with parametric approaches like Platt scaling, which assumes a sigmoidal relationship.
Glossary
Isotonic Regression

What is Isotonic Regression?
Isotonic regression is a non-parametric, post-hoc method for calibrating the confidence scores of machine learning classifiers.
The algorithm works by finding the best-fit non-decreasing function that minimizes the mean squared error against the true binary labels on the calibration data, typically implemented via the pair-adjacent violators (PAV) algorithm. This results in a step function that bins and recalibrates the scores. While highly flexible, it requires sufficient calibration data to avoid overfitting and is primarily used for binary classification calibration, with extensions for multi-class settings. Its performance is commonly evaluated using metrics like the Expected Calibration Error (ECE) and visualized on a reliability diagram.
Key Characteristics of Isotonic Regression
Isotonic regression is a non-parametric post-hoc calibration method that fits a piecewise constant, non-decreasing function to map a classifier's raw outputs to calibrated probabilities, making minimal assumptions about the underlying distribution.
Non-Parametric & Assumption-Free
Unlike parametric methods like Platt scaling, isotonic regression makes no assumption about the functional form (e.g., logistic) of the relationship between raw scores and true probabilities. It learns a flexible, piecewise constant mapping directly from the calibration data, making it powerful for complex, non-linear miscalibration patterns.
Monotonicity Constraint
The core constraint is monotonic non-decreasing. The fitted function ensures that if one input score is higher than another, its calibrated probability will be at least as high. This preserves the ranking order of the model's original predictions while correcting the confidence levels, which is crucial for metrics like AUC-ROC.
Piecewise Constant (Step Function) Output
The learned calibration map is a step function (also called a simple function). The algorithm:
- Bins the input scores.
- Averages the true outcomes within each bin.
- Assigns that average as the calibrated probability for all scores in that bin. This creates a robust, non-smooth mapping that directly reflects empirical accuracy.
Data Efficiency & Overfitting Risk
Isotonic regression requires a sufficiently large calibration set (typically hundreds to thousands of samples) to reliably estimate the empirical accuracy within each bin. With small data, it can overfit and produce an overly complex step function that fails to generalize, making it less suitable than parametric methods in low-data regimes.
Application to Multi-Class Problems
For multi-class calibration, the standard approach is one-vs-all (OvA). A separate isotonic regression model is trained for each class using the model's score for that class versus all others. The resulting calibrated probabilities must then be renormalized (e.g., via softmax) to sum to one across classes.
Comparison to Temperature Scaling
Isotonic Regression is flexible and can correct any monotonic miscalibration but requires more data and is prone to overfitting. Temperature Scaling uses a single parameter, is highly data-efficient and robust, but can only correct a specific, simple form of miscalibration (over/under-confidence). Isotonic regression is often preferred when ample calibration data is available and miscalibration is severe and complex.
Isotonic Regression vs. Other Calibration Methods
A technical comparison of post-hoc calibration techniques based on their underlying assumptions, flexibility, computational characteristics, and suitability for different data regimes.
| Feature / Characteristic | Isotonic Regression | Platt Scaling (Sigmoid Calibration) | Temperature Scaling |
|---|---|---|---|
Mathematical Formulation | Piecewise constant, non-decreasing function (non-parametric) | Logistic regression (parametric) | Single scaling parameter applied to logits (parametric) |
Core Assumption | Monotonic relationship between scores and true probabilities | Scores follow a sigmoidal distribution | Optimal adjustment is a uniform scaling of logits |
Flexibility / Complexity | High (can model any monotonic shape) | Medium (constrained to sigmoid shape) | Low (single degree of freedom) |
Data Efficiency | Low (requires ~1000+ samples for stable fit) | Medium (requires ~100s of samples) | High (can be fit with ~10s of samples) |
Risk of Overfitting | High (with small calibration sets) | Medium | Very Low |
Primary Use Case | Binary classification with large, representative calibration sets | Binary classification with moderate calibration sets | Multi-class neural network calibration with limited data |
Output Guarantee | Produces calibrated probabilities in [0,1] | Produces calibrated probabilities in [0,1] | Produces valid probability distributions (sum to 1) |
Computational Cost (Fit) | O(n log n) for the PAVA algorithm | O(n) for logistic regression optimization | O(n) for scalar optimization |
Computational Cost (Inference) | O(log m) for m bins (piecewise lookup) | O(1) (sigmoid evaluation) | O(1) (scalar multiplication & softmax) |
Differentiable | |||
Extends to Multi-Class | Via 1-vs-all or multi-class PAVA (complex) | Via 1-vs-all (common) | |
Handles Non-Monotonic Scores | |||
Commonly Used For | Non-neural models (e.g., SVMs, boosted trees), well-defined binary tasks | SVMs, boosted trees, binary neural classifiers | Deep neural networks (especially image classifiers) |
Library Implementation | Scikit-learn's | Scikit-learn's | Custom implementation or libraries like |
Practical Applications of Isotonic Regression
Isotonic regression is a powerful non-parametric tool for aligning a model's predicted probabilities with reality. Its key applications extend beyond simple calibration to areas requiring monotonic relationships and reliable confidence estimates.
Post-Hoc Classifier Calibration
This is the canonical use case. Isotonic regression is applied to the raw scores (logits) of a pre-trained classifier to produce calibrated probabilities. It fits a piecewise constant, non-decreasing function that maps the classifier's outputs to probabilities that accurately reflect the true likelihood of an event.
- Process: A held-out calibration set is used to fit the isotonic model. For each input, the classifier's score and the true binary label form the training data for the regression.
- Advantage over parametric methods: Unlike Platt scaling (logistic regression), it makes no assumption about the sigmoidal shape of the miscalibration, allowing it to correct complex, non-linear miscalibration patterns.
Medical Diagnostic Scoring & Risk Assessment
Isotonic regression is used to calibrate risk prediction models where a monotonic relationship between a score and risk probability is clinically required. For example, converting a composite health score into a well-calibrated probability of disease onset or mortality.
- Example: A model outputs a "severity score" from 1 to 100 for a patient. Isotonic regression maps these scores to calibrated probabilities of ICU admission, ensuring that a score of 80 always implies a higher risk than a score of 60.
- Benefit: Provides clinicians with reliable, interpretable probabilities for decision-making, adhering to the natural monotonic constraint that higher scores mean higher risk.
Credit Scoring & Probability of Default
In financial risk modeling, credit scores must map monotonically to the probability of default (PD). Isotonic regression is used to transform the outputs of complex machine learning models into PDs that satisfy this regulatory and business requirement.
- Regulatory Compliance: Regulations like Basel II/III require calibrated PD estimates. Isotonic regression ensures the monotonicity demanded by credit risk frameworks.
- Process: A model predicts a default propensity score. Isotonic regression on historical default data produces the final PD, guaranteeing that a customer with a higher propensity score receives a higher PD.
E-Commerce & Click-Through Rate (CTR) Calibration
Ranking models for ads or recommendations often output scores that need to be converted into accurate estimated CTRs for bid optimization and inventory pricing. Isotonic regression calibrates these scores using historical impression/click data.
- Use Case: A deep learning model scores an ad's relevance. The raw score is not a probability. Isotonic regression calibrates it to a true CTR (e.g., 0.05 means a 5% click probability).
- Business Impact: Enables accurate value estimation for real-time bidding systems, where bids are often calculated as
bid = CTR * value_per_click.
Enhancing Conformal Prediction
Isotonic regression can be integrated with conformal prediction to create more efficient (tighter) prediction sets. While conformal prediction provides coverage guarantees, it doesn't optimize interval size. Calibrating the underlying model's scores with isotonic regression often leads to scores that better reflect uncertainty, resulting in smaller, more precise prediction sets while maintaining the same coverage guarantee.
- Mechanism: Better-calibrated probabilities allow the conformal score function (e.g.,
1 - p_true) to more accurately rank examples by uncertainty. - Result: For a target 90% coverage, the prediction sets generated from a calibrated model will typically contain fewer irrelevant labels compared to an uncalibrated model.
Calibrating Large Language Model (LLM) Outputs
Isotonic regression is applied to calibrate confidence scores for LLM generations, such as the probability assigned to a multiple-choice answer or the confidence in a factual statement. This is critical for trustworthy AI and enabling models to express meaningful uncertainty.
- Challenge: LLMs are notoriously overconfident. Their softmax probabilities do not reflect true likelihoods.
- Application: On a validation set of questions, the model's predicted probability for the chosen answer is recorded along with correctness (1/0). Isotonic regression fits a mapping from these flawed probabilities to calibrated ones.
- Outcome: A calibrated LLM can more reliably use its own confidence score to decide when to abstain or seek human help, a key component of selective calibration.
Frequently Asked Questions
Isotonic regression is a non-parametric post-processing method used to calibrate a classifier's confidence scores. These questions address its core mechanics, applications, and trade-offs for machine learning practitioners.
Isotonic regression is a non-parametric post-hoc calibration method that fits a piecewise constant, monotonically non-decreasing function to map a classifier's raw output scores (e.g., logits or unscaled probabilities) to calibrated probabilities that accurately reflect the true likelihood of correctness. Unlike parametric methods like Platt scaling, it makes minimal assumptions about the shape of the underlying score distribution, allowing it to model complex, non-linear miscalibration patterns. The algorithm works by finding a function that minimizes the mean squared error against the true binary labels on a calibration set, subject to the constraint that the function's output never decreases as the input score increases. This ensures the ordinal ranking of predictions is preserved while correcting confidence estimates.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Isotonic regression is a core technique within a broader ecosystem of methods for ensuring a model's confidence scores are trustworthy. The following terms define related calibration approaches, evaluation metrics, and operational concepts.
Platt Scaling
Platt scaling (or sigmoid calibration) is a parametric post-hoc calibration method for binary classifiers. It fits a logistic regression model to the classifier's raw output scores (logits) to map them to calibrated probabilities. Unlike isotonic regression, it assumes a specific sigmoidal shape for the calibration function.
- Parametric vs. Non-Parametric: Assumes a specific functional form (sigmoid), whereas isotonic regression is non-parametric.
- Application: Primarily used for binary classification. Extension to multi-class settings (e.g., OvR - One-vs-Rest) is common.
- Efficiency: Requires fitting only two parameters (slope and intercept), making it less flexible but more data-efficient than isotonic regression on very small calibration sets.
Temperature Scaling
Temperature scaling is a single-parameter, post-hoc calibration technique that applies a scalar 'temperature' (T) to the logits of a neural network before the softmax function. A T > 1 softens the output distribution (reducing confidence), while T < 1 sharpens it.
- Simplicity: A lightweight, widely used method for modern neural networks, especially in multi-class settings.
- Limitation: Only adjusts the 'spread' of the distribution, maintaining the original ranking of classes. It cannot correct for systematic biases in the original probabilities.
- Relation to Isotonic: Isotonic regression is more flexible and can learn non-linear, non-monotonic mappings (though constrained to be non-decreasing), whereas temperature scaling is a global linear transformation.
Expected Calibration Error (ECE)
Expected Calibration Error (ECE) is a primary scalar metric for quantifying miscalibration. It bins predictions based on their confidence score and computes the weighted average of the absolute difference between average confidence and empirical accuracy within each bin.
- Calculation: ECE = Σ (n_b / N) * |acc(b) - conf(b)|, where n_b is the number of samples in bin b, N is total samples, acc(b) is the accuracy in bin b, and conf(b) is the average confidence in bin b.
- Diagnostic Use: A well-calibrated model has an ECE near zero. It is the standard benchmark for evaluating methods like isotonic regression.
- Limitation: Sensitive to the number and placement of bins. Adaptive binning schemes can mitigate this.
Reliability Diagram
A reliability diagram is the primary visual tool for diagnosing calibration. It plots a model's average predicted confidence (x-axis) against its observed empirical accuracy (y-axis) across multiple confidence bins.
- Interpretation: Points on the diagonal (y=x) indicate perfect calibration. Deviations below the diagonal signify overconfidence; deviations above signify underconfidence.
- Visualizing Isotonic Fit: The output of isotonic regression is a piecewise constant function that can be overlaid on a reliability diagram, showing how it 'pushes' the original miscalibrated curve toward the ideal diagonal.
- Foundation: The graphical counterpart to the numerical ECE metric.
Post-Hoc Calibration
Post-hoc calibration is the overarching paradigm for applying calibration transformations after a model is trained, without modifying its internal parameters. Isotonic regression is a canonical non-parametric method within this family.
- Key Principle: Decouples the task of achieving high accuracy (discrimination) from the task of achieving calibrated probabilities (calibration).
- Workflow: Requires a held-out calibration set, distinct from training and test data, to fit the calibration mapping (e.g., the isotonic function).
- Other Methods: Includes Platt scaling, temperature scaling, and beta calibration. The choice depends on data size, model type, and the nature of miscalibration.
Calibration Set
A calibration set (or hold-out calibration set) is a dedicated dataset used exclusively to fit the parameters of a post-hoc calibration method. It is a critical component for techniques like isotonic regression and Platt scaling.
- Data Partitioning: Typically drawn from the same distribution as the training data but held out from both the training and test/validation sets.
- Size Requirements: Non-parametric methods like isotonic regression generally require more calibration data (hundreds to thousands of samples) than parametric methods like temperature scaling to avoid overfitting.
- Operational Role: In production, maintaining a representative calibration set is essential for periodic recalibration to combat calibration drift.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us