Calibration via Platt, formally known as Platt scaling, is a parametric post-hoc calibration method for binary classifiers. It fits a logistic regression model—specifically, a sigmoid function—to the raw, uncalibrated scores (logits) output by a model. This learned transformation maps these scores to well-calibrated probability estimates that accurately reflect the true likelihood of a positive outcome. The method requires a held-out calibration set, distinct from training and test data, to fit its two parameters (slope and intercept).
Glossary
Calibration via Platt

What is Calibration via Platt?
Calibration via Platt is a common shorthand for applying Platt scaling, a logistic regression-based method, to transform a classifier's scores into calibrated probabilities.
The technique is named for its inventor, John Platt, who introduced it for calibrating Support Vector Machine outputs. Its primary advantage is simplicity and low risk of overfitting due to its minimal two-parameter model. However, it assumes the raw scores follow a sigmoidal distribution, which may not hold for all classifiers. For multi-class calibration, the method is often extended as Platt scaling per class or used within an OvR (One-vs-Rest) framework. It is a foundational technique often compared to temperature scaling (simpler) and isotonic regression (more flexible, non-parametric).
Key Characteristics of Platt Scaling
Platt scaling is a parametric, post-hoc calibration technique that applies logistic regression to a classifier's outputs to produce calibrated probability estimates.
Parametric Logistic Mapping
Platt scaling fits a logistic regression model with two parameters (a scaling weight and a bias term) to map the classifier's raw scores (logits) to calibrated probabilities. The transformation is defined as: P(y=1 | s) = 1 / (1 + exp(A * s + B)), where s is the classifier's score. This assumes the uncalibrated scores follow a sigmoidal distribution, which is often a valid approximation for many discriminative models like SVMs and neural networks.
Requires a Held-Out Calibration Set
The logistic regression parameters (A, B) are not learned during the original model training. They are estimated using a separate calibration set—a held-out dataset not used for training or final testing. This set should be representative of the target distribution. The method minimizes the negative log-likelihood on this set, treating the true class labels as the target for the logistic regression. Using the training data for calibration would lead to overfitting and unreliable probability estimates.
Primarily for Binary Classification
The standard Platt scaling formulation is designed for binary classification. It calibrates the scores for the positive class. For multi-class problems, the common extension is the One-vs-Rest (OvR) strategy: calibrate each class against all others independently, then normalize the resulting probabilities across classes (e.g., via softmax). However, this can be computationally intensive and may not guarantee well-calibrated multi-class probabilities as effectively as dedicated multi-class methods like temperature scaling.
Post-Hoc and Model-Agnostic
Platt scaling is a post-hoc method, meaning it is applied after a model is fully trained, without modifying its internal parameters. It is also model-agnostic; it works on the scores output by any classifier, including Support Vector Machines (where it was originally developed), boosted trees, and neural networks. This decoupling allows calibration to be treated as a separate pipeline step, facilitating integration into existing MLOps workflows.
Risk of Overfitting on Small Calibration Sets
As a parametric method, Platt scaling can overfit when the calibration set is very small (e.g., fewer than 1000 instances). The logistic regression model may learn a mapping that fits noise rather than the true miscalibration pattern. In such cases, a less flexible, non-parametric method like isotonic regression may be more robust. The reliability of Platt scaling is highly dependent on the size and quality of the calibration data.
Comparison to Temperature Scaling
For neural networks, temperature scaling is a simpler, more constrained special case of Platt scaling. Temperature scaling uses a single parameter T (temperature) to scale all logits: logits_scaled = logits / T. Platt scaling, with its two parameters (A, B), is more flexible. However, this flexibility can be a disadvantage for neural nets, as the extra degree of freedom can lead to overfitting on the calibration set, whereas temperature scaling's single parameter often provides more reliable calibration with modern deep learning models.
Platt Scaling vs. Other Calibration Methods
A feature comparison of common post-hoc calibration techniques for binary and multi-class classifiers.
| Feature / Metric | Platt Scaling (Sigmoid Calibration) | Temperature Scaling | Isotonic Regression |
|---|---|---|---|
Method Type | Parametric | Parametric | Non-Parametric |
Underlying Model | Logistic Regression | Single Scalar (Temperature) | Piecewise Constant, Non-Decreasing Function |
Primary Use Case | Binary Classification | Multi-Class Classification | Binary & Multi-Class (1-vs-All) |
Data Efficiency | Requires ~100+ samples | Requires ~100+ samples | Requires ~1000+ samples |
Risk of Overfitting | Low (2 parameters) | Very Low (1 parameter) | Medium (flexible function) |
Output Guarantee | Monotonic transformation | Monotonic transformation | Monotonic transformation |
Computational Cost | Low (convex optimization) | Very Low (scalar optimization) | Medium (pair-adjacent violators algorithm) |
Handles Multi-Modal Distributions | |||
Common Evaluation Metric | Brier Score, ECE, NLL | Brier Score, ECE, NLL | Brier Score, ECE, NLL |
Typical Performance (ECE Reduction) | 30-70% | 20-60% | 40-80% |
Frequently Asked Questions
Platt scaling, commonly referred to as calibration via Platt, is a foundational technique in machine learning for ensuring a classifier's confidence scores are trustworthy. This FAQ addresses its core mechanics, applications, and relationship to other calibration methods.
Platt scaling is a parametric, post-hoc calibration method that transforms a binary classifier's raw output scores (logits) into well-calibrated probability estimates by fitting a logistic regression model. The technique works by taking the original model's scores on a held-out calibration set and learning two parameters: a scaling factor and a bias term. These parameters adjust the scores via the logistic sigmoid function, σ(ax + b), to produce probabilities that accurately reflect the true likelihood of the positive class. It assumes the uncalibrated scores follow a sigmoidal distribution, which is often a valid approximation for outputs from models like Support Vector Machines (SVMs) and neural networks.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Platt scaling is a foundational technique within the broader discipline of model calibration. These related terms define the metrics, alternative methods, and operational concepts that surround its application.
Platt Scaling
Platt scaling is the specific parametric method for which 'Calibration via Platt' is a shorthand. It fits a logistic regression model with two parameters (a weight and a bias) to the logits (pre-softmax scores) of a binary classifier. This learned sigmoid function transforms the scores into well-calibrated probabilities that better reflect the true likelihood of correctness.
- Core Mechanism: Applies
P_calibrated = σ(a * s + b), wheresis the model's raw score, andaandbare learned on a held-out calibration set. - Assumption: Assumes the raw scores follow a sigmoidal distribution, which often holds for outputs from models like SVMs or neural networks.
Temperature Scaling
Temperature scaling is a simpler, single-parameter alternative to Platt scaling, primarily used for multi-class neural network classifiers. It applies a scalar 'temperature' (T) to soften or sharpen the logits before the softmax: softmax(logits / T).
- Key Difference: Uses one learned parameter vs. Platt's two, making it less flexible but more stable with limited calibration data.
- Primary Use Case: The de facto standard for calibrating modern neural networks with a softmax output layer, whereas Platt scaling is more common for binary outputs or scores from models like SVMs.
Isotonic Regression
Isotonic regression is a powerful non-parametric calibration method that fits a piecewise constant, non-decreasing function to map scores to probabilities. It makes minimal assumptions about the underlying score distribution.
- Advantage over Parametric Methods: More flexible than Platt or temperature scaling and can model complex, non-sigmoidal miscalibration patterns.
- Disadvantage: Requires more calibration data to avoid overfitting and can be less stable than parametric methods on small datasets. 'Calibration via Isotonic' is its common implementation shorthand.
Expected Calibration Error (ECE)
Expected Calibration Error (ECE) is the primary scalar metric for quantifying miscalibration. It works by:
- Binning predictions based on their confidence score (e.g., 0.0-0.1, 0.1-0.2).
- For each bin, calculating the difference between the average confidence (predicted probability) and the empirical accuracy (fraction correct).
- Taking a weighted average of these absolute differences.
A lower ECE indicates better calibration. It is the standard benchmark for evaluating methods like Platt scaling.
Calibration Set
A calibration set (or hold-out validation set) is a critical data partition used exclusively for fitting post-hoc calibration methods. It must be distinct from the training data (to avoid overfitting) and the test data (to ensure unbiased evaluation).
- Purpose for Platt Scaling: This set provides the
(score, true label)pairs used to learn the logistic regression parametersaandb. - Size Considerations: Typically requires hundreds to thousands of samples. Too small a set leads to poorly estimated parameters; using the test set for calibration invalidates performance metrics.
Post-Hoc Calibration
Post-hoc calibration is the overarching category of techniques that adjust a trained model's outputs without modifying its internal weights. Platt scaling is a canonical example.
- Key Principle: Treats the base model as a fixed black box that produces (potentially miscalibrated) scores. A separate, lightweight calibration function is then learned on top.
- Workflow: 1. Train model. 2. Generate scores on a calibration set. 3. Learn calibration mapping (e.g., sigmoid for Platt). 4. Apply mapping to new predictions.
- Contrast with Calibration-Aware Training: Does not change the core training loop, making it simple to deploy on existing models.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us