Platt scaling, also known as sigmoid calibration, is a post-hoc calibration method that fits a logistic regression model to the unscaled output scores (logits) of a pre-trained binary classifier to produce probability estimates that better reflect the true likelihood of correctness. It transforms the classifier's scores by learning two parameters—a scaling factor and a bias term—on a held-out calibration set, mapping the scores through a sigmoid function to generate calibrated probabilities between 0 and 1.
Glossary
Platt Scaling

What is Platt Scaling?
Platt scaling is a foundational parametric method for calibrating the confidence scores of machine learning classifiers.
The method assumes the underlying uncalibrated scores follow a sigmoidal distribution, which often holds for outputs from support vector machines (SVMs) and modern neural networks. While simple and effective for binary tasks, its parametric assumption can be a limitation if violated. Platt scaling is a core technique within the broader field of post-hoc calibration, alongside non-parametric methods like isotonic regression and simpler approaches like temperature scaling for neural networks.
Key Characteristics of Platt Scaling
Platt scaling is a parametric, post-hoc calibration method that transforms a classifier's raw scores into well-calibrated probability estimates using logistic regression.
Parametric Logistic Mapping
Platt scaling fits a logistic regression model with two parameters (A, B) to the classifier's raw outputs (logits). The mapping is defined as: P(y=1 | s) = 1 / (1 + exp(A * s + B)), where s is the classifier score. This sigmoid function ensures outputs are valid probabilities between 0 and 1. It assumes the uncalibrated scores follow a sigmoidal distribution, which is often a reasonable approximation for many discriminative models like SVMs.
Requires a Held-Out Calibration Set
The method is post-hoc, meaning it is applied after the base model is fully trained. It requires a separate calibration dataset, distinct from the training and test sets. This dataset is used solely to learn the parameters A and B via maximum likelihood estimation. Using the training data for calibration would lead to overfitting and unreliable probability estimates on new data. The size of this set directly impacts the stability of the parameter estimates.
Primarily for Binary Classification
The standard formulation is designed for binary classification. It calibrates the scores for the positive class. For multi-class problems, the standard approach is the one-vs-rest (OvR) strategy: train a separate Platt scaling calibrator for each class against all others, then normalize the resulting probabilities across classes. This can be computationally intensive and may not guarantee perfect multi-class calibration.
Computational Efficiency
The calibration process is computationally inexpensive. Fitting the logistic regression model is a convex optimization problem that converges quickly. At inference time, applying the calibration is a simple affine transformation of the logit followed by a sigmoid, adding negligible latency. This makes it highly suitable for production systems where throughput and low latency are critical.
Risk of Overfitting on Small Data
The method's main weakness is its susceptibility to overfitting when the calibration set is small. With limited data, the estimated parameters (A, B) can have high variance, leading to poorly calibrated probabilities on new data. This risk is mitigated by using a sufficiently large calibration set (typically hundreds to thousands of samples) or by applying regularization (e.g., L2 penalty) during the logistic regression fit.
Common Base Classifiers
Platt scaling was originally developed for Support Vector Machines (SVMs), which output uncalibrated decision values. It is equally effective for other models that produce scores interpretable as confidence measures, including:
- Linear models (logistic regression, although often already calibrated)
- Boosted trees (e.g., XGBoost, LightGBM)
- Neural networks (using pre-softmax logits) It is less commonly applied to models like naive Bayes, which may already produce distorted probability estimates.
Platt Scaling vs. Other Calibration Methods
A technical comparison of post-hoc calibration techniques for aligning a model's predicted confidence with its empirical accuracy.
| Feature / Characteristic | Platt Scaling (Sigmoid Calibration) | Temperature Scaling | Isotonic Regression |
|---|---|---|---|
Method Type | Parametric | Parametric | Non-Parametric |
Underlying Model | Logistic Regression | Single Scalar (Temperature) | Piecewise Constant, Non-Decreasing Function |
Assumption on Score Distribution | Scores follow a sigmoid distribution | Logits are rescalable by a single factor | Minimal; only monotonic relationship |
Typical Data Requirement | ~1,000 calibration instances | ~100-1,000 calibration instances | ~1,000+ calibration instances (more sensitive to sample size) |
Output Flexibility | Calibrates binary probabilities | Calibrates multi-class probabilities | Calibrates binary or multi-class probabilities |
Risk of Overfitting | Moderate (2 parameters) | Low (1 parameter) | High (can overfit with small data) |
Computational Cost (Fit) | Low (convex optimization) | Very Low (linear search or convex) | Moderate (pair-adjacent violators algorithm) |
Computational Cost (Apply) | Very Low (sigmoid function) | Very Low (scalar multiplication & softmax) | Low (piecewise constant lookup) |
Handles Non-Monotonic Miscalibration | |||
Primary Use Case | Binary classifiers (e.g., SVM, boosted trees) | Neural networks with softmax output | Any classifier, especially with complex miscalibration patterns |
Key Limitation | Assumes sigmoid shape may be incorrect | Only adjusts confidence spread, not shape | Prone to overfitting on small calibration sets |
Applications and Use Cases
Platt scaling is a foundational technique for aligning a model's predicted confidence with reality. Its primary applications are in domains where reliable probability estimates are critical for downstream decision-making and risk assessment.
Medical Diagnostics & Risk Scoring
In clinical settings, a model's predicted probability directly informs treatment decisions. A radiologist needs to know if a '90% confidence' in a tumor detection is truly a 90% likelihood. Platt scaling calibrates these scores, enabling:
- Informed patient triage based on reliable risk stratification.
- Cost-benefit analysis for invasive follow-up procedures.
- Integration into clinical decision support systems where overconfidence can lead to harmful false assurances.
Financial Fraud Detection
Transaction fraud models output a 'fraud score.' For operational efficiency, analysts set a probability threshold to flag transactions for review. Platt scaling ensures that a score of 0.95 means a transaction has a 95% chance of being fraudulent, allowing for:
- Precise resource allocation: High-confidence alerts are prioritized.
- Accurate false positive rate control, directly impacting customer experience and operational costs.
- Regulatory reporting that requires statistically sound estimates of risk exposure.
Calibrating Modern Neural Networks
Deep neural networks, particularly those trained with cross-entropy loss, are notoriously overconfident. Their softmax outputs are not true probabilities. Platt scaling is applied post-training to:
- Rectify miscalibration in models like ResNets, Vision Transformers, and large language models (LLMs) for classification tasks.
- Serve as a baseline method against which newer techniques like temperature scaling are compared.
- Provide a simple, effective fix without retraining the entire model, crucial for production efficiency.
Foundation for Conformal Prediction
Conformal prediction is a framework for generating prediction sets with guaranteed statistical coverage (e.g., 90% of sets will contain the true label). It often uses a non-conformity score. Platt scaling can be used to generate well-calibrated probabilities that serve as ideal non-conformity scores, leading to:
- Tighter, more efficient prediction sets compared to using raw model scores.
- Provable guarantees of coverage that hold in practice because the underlying probabilities are calibrated.
- Applications in high-stakes areas like autonomous driving (predicting pedestrian intent) where understanding uncertainty is safety-critical.
Resource-Constrained & Edge AI
On edge devices, model retraining is often infeasible. Platt scaling offers a lightweight post-processing step. A simple logistic regression model can be fitted on a small calibration set and deployed alongside the main model to:
- Dramatically improve decision reliability with minimal compute overhead.
- Enable dynamic confidence-based filtering; for example, a wildlife camera trap can discard low-confidence images, saving bandwidth.
- Maintain calibration even as the primary model becomes stale, extending its useful life.
Critical Limitations & Assumptions
Platt scaling is not a universal solution. Its effectiveness hinges on specific conditions:
- Binary Classification Focus: It is designed for binary tasks. Multi-class problems require the One-vs-Rest strategy, fitting a separate calibrator per class.
- Parametric Assumption: It assumes the relationship between logits and true probability follows a sigmoid curve. If this assumption is violated (e.g., a complex, non-monotonic relationship), non-parametric methods like Isotonic Regression may be superior.
- Calibration Set Quality: Performance degrades if the calibration data is not i.i.d. with the test/production data or is too small, leading to poorly estimated parameters.
Frequently Asked Questions
Platt scaling is a foundational technique for aligning a model's predicted confidence with its actual accuracy. These questions address its core mechanics, applications, and how it compares to other methods.
Platt scaling is a parametric post-hoc calibration method that fits a logistic regression model to the raw, unnormalized outputs (logits) of a binary classifier to produce better-calibrated probability estimates. It works by taking the classifier's scores, which may be poorly calibrated (e.g., a score of 0.9 does not correspond to a 90% chance of being correct), and mapping them to new probabilities via a sigmoid function with learned parameters. The process requires a held-out calibration set, distinct from the training data. On this set, it learns two parameters: a scaling factor (often called the 'temperature' analog) and a bias term. The transformed probability is calculated as P(calibrated) = 1 / (1 + exp(A * score + B)), where A and B are the learned parameters. This simple linear transformation before the sigmoid often dramatically improves the reliability diagram by making the model's confidence scores meaningful and trustworthy.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Platt scaling is one method within a broader ecosystem of techniques designed to ensure a model's confidence scores are trustworthy. These related concepts cover alternative calibration methods, evaluation metrics, and operational frameworks.
Temperature Scaling
A single-parameter post-hoc calibration method for neural networks. It applies a scalar 'temperature' (T) to the logits before the softmax: softmax(logits / T). A T > 1 softens the output distribution (reduces overconfidence), while T < 1 sharpens it. It's a generalization of Platt scaling for multi-class settings and is often the first baseline due to its simplicity and low risk of overfitting.
Isotonic Regression
A non-parametric post-hoc calibration method. It fits a piecewise constant, non-decreasing function to map raw classifier scores to calibrated probabilities. Unlike parametric methods (Platt, Temperature), it makes minimal assumptions about the shape of the calibration mapping. It is more flexible and can model complex miscalibration patterns but requires more calibration data and is prone to overfitting on small sets.
Expected Calibration Error (ECE)
The primary scalar metric for quantifying miscalibration. It works by:
- Binning predictions based on their confidence score (e.g., 0-0.1, 0.1-0.2).
- For each bin, calculating the average confidence and the empirical accuracy.
- Computing a weighted average of the absolute difference between confidence and accuracy across all bins. A lower ECE indicates better calibration. It is a standard benchmark for comparing calibration methods.
Reliability Diagram
The visual counterpart to the ECE. It is a diagnostic plot where:
- The x-axis represents the average predicted confidence within a bin.
- The y-axis represents the observed empirical accuracy within that bin.
- A perfectly calibrated model's points lie on the diagonal (y=x) line. Deviations from the diagonal reveal the nature of miscalibration: points below the line indicate overconfidence, while points above indicate underconfidence.
Brier Score
A proper scoring rule that evaluates the overall quality of probabilistic predictions. For binary classification, it is the mean squared error between the predicted probability and the true outcome (0 or 1). The score combines two aspects:
- Calibration: How well confidence matches accuracy.
- Refinement/Sharpness: How concentrated the predictions are near 0 or 1. A lower Brier score is better. It is used both as a training loss and an evaluation metric.
Calibration Set
A held-out dataset used exclusively for fitting post-hoc calibration parameters. Critical requirements:
- Must be distinct from the training and test sets.
- Should be representative of the expected production data distribution.
- Size matters: Parametric methods (Platt, Temperature) need less data (~1000 samples), while non-parametric methods (Isotonic) need more to avoid overfitting. Using the test set for calibration is a methodological error that invalidates performance estimates.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us