A calibration curve is a diagnostic plot that visualizes the relationship between a machine learning model's predicted probability scores and the actual observed frequency of correctness for those predictions. It is a core tool for assessing confidence calibration, determining if a model's stated confidence (e.g., "80% sure") matches its empirical accuracy. A perfectly calibrated model's curve follows the diagonal line where predicted probability equals observed accuracy. Deviations from this line reveal overconfidence (curve below diagonal) or underconfidence (curve above diagonal).
Glossary
Calibration Curve

What is a Calibration Curve?
A calibration curve is a diagnostic plot that visualizes the relationship between a model's predicted probabilities and the actual observed frequencies of correctness, used to assess and improve confidence calibration.
In agentic self-evaluation, calibration curves are critical for enabling autonomous systems to reliably assess the trustworthiness of their own outputs. A well-calibrated agent can use its internal confidence scores to trigger self-correction loops or abstention mechanisms when uncertainty is high. The curve is constructed by binning predictions by their confidence score and plotting the bin's mean predicted probability against its actual accuracy. Metrics like Expected Calibration Error (ECE) quantify miscalibration from this plot. Techniques such as temperature scaling or isotonic regression are applied post-training to adjust a model's logits and improve its calibration, making its self-assessed confidence actionable for downstream decision-making.
Key Characteristics of Calibration Curves
A calibration curve is a fundamental diagnostic plot for assessing the reliability of a probabilistic classifier's confidence scores. It visualizes the alignment between predicted probabilities and empirical outcomes.
Definition and Purpose
A calibration curve (or reliability diagram) plots a model's predicted probabilities (binned on the x-axis) against the observed frequency of positive outcomes (y-axis) for instances in each bin. Its primary purpose is to diagnose miscalibration—where a model's confidence scores do not reflect true likelihoods. For example, if a model predicts a 70% probability of an event, a well-calibrated model should see that event occur roughly 70% of the time in reality. This is critical for decision-making under uncertainty, such as in medical diagnosis or autonomous systems, where confidence must be trustworthy.
The Perfect Calibration Line
The ideal calibration is represented by a diagonal line from (0,0) to (1,1), where predicted probability perfectly matches observed frequency. Deviations from this line indicate miscalibration:
- Overconfidence (Underestimation): The curve lies below the diagonal. The model predicts probabilities that are too high (e.g., predicts 90% confidence but is only correct 70% of the time). Common in modern deep neural networks.
- Underconfidence (Overestimation): The curve lies above the diagonal. The model is too conservative, assigning lower probabilities than the empirical accuracy warrants (e.g., predicts 50% confidence but is correct 80% of the time).
Quantifying Miscalibration: Expected Calibration Error (ECE)
The Expected Calibration Error (ECE) is the primary scalar metric derived from a calibration curve. It quantifies the average absolute difference between confidence and accuracy.
Calculation:
- Partition predictions into M bins (e.g., 10 bins of width 0.1).
- For each bin, calculate:
- Average Confidence: Mean predicted probability in the bin.
- Average Accuracy: Fraction of correct predictions in the bin.
- Compute a weighted average:
ECE = Σ (n_m / N) * |Accuracy_m - Confidence_m|
Where n_m is the number of samples in bin m and N is the total samples. A lower ECE indicates better calibration. ECE is sensitive to binning strategy, so the number of bins (M) must be chosen carefully.
Relationship to Other Metrics
Calibration is distinct from, yet complementary to, standard discriminative performance metrics:
- vs. Accuracy: A model can have high accuracy but be poorly calibrated (e.g., always predicting 0.51 for positive class).
- vs. AUC-ROC: The Area Under the ROC Curve measures ranking ability, not the absolute correctness of probability scores. A model can have perfect AUC but terrible calibration.
- Brier Score: This proper scoring rule decomposes into Calibration Loss + Refinement Loss. The calibration loss component directly measures miscalibration, making the Brier Score a unified metric for both discrimination and calibration.
Key Insight: For high-stakes applications, both high discriminative performance (AUC/Accuracy) and good calibration (low ECE) are required.
Calibration Methods (Post-Processing)
If a calibration curve reveals miscalibration, it can often be corrected via post-processing techniques applied to a trained model's outputs:
- Platt Scaling: Fits a logistic regression model to map the model's scores to calibrated probabilities. Best for maximum-margin classifiers like SVMs.
- Isotonic Regression: A non-parametric method that fits a piecewise constant, non-decreasing function. More powerful but requires more data and can overfit.
- Temperature Scaling (for neural networks): A single-parameter variant of Platt Scaling used to soften (temperature > 1) or sharpen (temperature < 1) the softmax distribution of a neural network. It is the most common method for modern LLMs and vision models.
These methods are typically trained on a separate validation set to avoid overfitting.
Use in Agentic Self-Evaluation
Within autonomous agent systems, calibration curves are a core tool for self-evaluation and recursive error correction.
- Confidence-Aware Abstention: An agent can use its calibration curve to set a confidence threshold. If its confidence for a planned action or answer falls below a calibrated level (e.g., where accuracy drops sharply), it can abstain and trigger a fallback (e.g., querying a human, using a different tool).
- Feedback for Iterative Refinement: Miscalibration detected via a curve can be a signal to an agent's self-critique mechanism that its internal confidence scoring is flawed, prompting it to adjust its reasoning or seek verification.
- Monitoring Drift: Tracking calibration curves over time in production can detect model drift or distribution shift, where the agent's confidence becomes misaligned with new data, triggering retraining or recalibration.
Common Miscalibration Patterns
This table identifies and describes typical patterns of miscalibration observed in a model's predicted probabilities, as visualized on a calibration curve. Each pattern indicates a specific systemic bias in how the model expresses confidence.
| Pattern Name | Calibration Curve Shape | Model Confidence Bias | Common Causes | Corrective Actions |
|---|---|---|---|---|
Overconfidence | S-shaped curve below the diagonal | Predicts probabilities that are too high for correct predictions and too low for incorrect ones. | Overfitting, lack of regularization, training on easy datasets, using poorly calibrated base models (e.g., deep neural networks). | Apply temperature scaling, use label smoothing during training, incorporate mixup or other data augmentation, employ model ensembles. |
Underconfidence | Inverted S-shaped curve above the diagonal | Predicts probabilities that are too conservative, clustering near 0.5, failing to express high certainty. | Excessive regularization (high weight decay), underfitting, training on very noisy or ambiguous labels. | Reduce regularization strength, improve model capacity/architecture, clean training data, use Platt scaling. |
Optimistic Bias | Curve consistently below the diagonal | Systematically overestimates the probability of the positive class across most confidence levels. | Class imbalance where the model is biased towards the majority class, miscalibrated decision thresholds. | Apply class re-weighting or balanced sampling, calibrate with Isotonic Regression, adjust decision threshold post-calibration. |
Pessimistic Bias | Curve consistently above the diagonal | Systematically underestimates the probability of the positive class across most confidence levels. | Class imbalance with bias against the minority class, loss functions that heavily penalize false positives. | Apply class re-weighting, use Focal Loss to down-weight easy negatives, calibrate with Isotonic Regression. |
Mid-Range Miscalibration | Curve is well-calibrated at extremes (0 and 1) but deviates in the middle (e.g., 0.3-0.7). | Poor confidence estimation for ambiguous cases, while being confident on clear-cut predictions. | Limited model capacity for edge cases, dataset lacks sufficient examples of moderate-difficulty samples. | Collect more data for ambiguous cases, use Bayesian neural networks to capture epistemic uncertainty, employ ensemble methods. |
Extremization | Curve has a steeper slope than the diagonal, crossing it near 0.5. | Model converts accurate but moderate predictions into overconfident extreme probabilities (near 0 or 1). | Post-processing methods like Platt scaling with overly aggressive parameters, certain ensemble methods that sharpen distributions. | Re-calibrate with a gentler scaling method (e.g., temperature scaling with T > 1), use ensemble averaging instead of voting. |
Bimodal Miscalibration | Curve has a non-monotonic, wavy pattern (e.g., zigzag). | Confidence is accurate for some subsets of data but inaccurate for others, often tied to hidden data subgroups. | Dataset contains distinct subpopulations with different feature distributions that the model has learned to treat differently. | Perform subgroup calibration, use multi-model approaches for different data types, apply more expressive calibration methods like Beta calibration. |
Methods for Model Calibration
Calibration curves are a primary diagnostic tool. These methods quantify and correct the gap between a model's predicted confidence and its empirical accuracy, a cornerstone of reliable agentic self-evaluation.
Platt Scaling
A parametric method that fits a logistic regression model to the outputs of a classifier to transform its scores into well-calibrated probabilities. It's particularly effective for support vector machines but is widely used as a post-processing step for neural networks.
- Process: Learns two parameters (A, B) to map scores:
P(y=1|s) = 1 / (1 + exp(A * s + B)). - Use Case: Best for models where scores are not naturally probabilistic, like margin-based classifiers.
Isotonic Regression
A non-parametric, binning-free method that fits a piecewise constant, non-decreasing function to map uncalibrated scores to calibrated probabilities. It is more flexible than Platt Scaling but requires more data to avoid overfitting.
- Process: Learns a stepwise function that minimizes the squared error while preserving the order of predictions.
- Use Case: Ideal for complex, non-sigmoidal miscalibration patterns where no simple parametric form is assumed.
Temperature Scaling
A lightweight, single-parameter variant of Platt Scaling used specifically for modern neural networks. It softens or sharpens the pre-softmax logits using a learned temperature parameter (T).
- Formula:
softmax(logits / T). - Key Property: Preserves the predicted class ranking (accuracy) while adjusting the confidence distribution. A
T > 1smoothes overconfident predictions,T < 1increases confidence separation.
Bayesian Binning into Quantiles (BBQ)
A Bayesian extension of histogram binning that accounts for uncertainty in the calibration mapping itself. Instead of a single calibration curve, it produces a distribution over possible curves.
- Mechanism: Uses Bayesian model averaging over different numbers and placements of bins.
- Advantage: Provides uncertainty estimates for the calibration itself, which is critical for high-stakes agentic decision-making where confidence in confidence is required.
Ensemble-based Calibration
Leverages the diversity of model ensembles—like bagging or deep ensembles—to improve both accuracy and calibration inherently. The variance in predictions across ensemble members provides a natural signal for uncertainty.
- Direct Method: Average the predicted probabilities from all ensemble members.
- Indirect Method: Use the ensemble's disagreement (e.g., variance of predictions) as a feature for a meta-calibrator like Platt Scaling.
Expected Calibration Error (ECE)
The primary quantitative metric for evaluating calibration, not a correction method. It measures miscalibration by binning predictions by confidence and comparing the average confidence in each bin to the bin's accuracy.
- Calculation:
ECE = Σ (|B_m| / n) * |acc(B_m) - conf(B_m)|across M bins. - Interpretation: A perfect ECE of 0.0 means confidence matches accuracy perfectly. It is the benchmark against which all calibration methods are measured.
Frequently Asked Questions
Essential questions about calibration curves, a core diagnostic tool for assessing and improving the confidence calibration of AI models and autonomous agents.
A calibration curve is a diagnostic plot that visualizes the relationship between a machine learning model's predicted probabilities and the actual observed frequencies of correctness. It works by grouping predictions into bins based on their confidence scores (e.g., 0-0.1, 0.1-0.2) and plotting the average predicted probability in each bin against the actual fraction of positive outcomes. A perfectly calibrated model's curve follows the 45-degree diagonal line, meaning a prediction of 80% confidence is correct 80% of the time. Deviations from this line reveal miscalibration, such as overconfidence (curve below the diagonal) or underconfidence (curve above the diagonal).
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
A calibration curve is a core diagnostic for assessing a model's self-awareness. These related concepts detail the mechanisms and metrics used to build reliable, self-correcting autonomous systems.
Confidence Calibration
Confidence calibration is the process of ensuring a model's predicted probability scores accurately reflect the true likelihood of its outputs being correct. A well-calibrated model that predicts an 80% probability should be correct 80% of the time. This is foundational for trustworthy decision-making in autonomous agents, allowing them to know when they know.
- Goal: Align subjective confidence with objective accuracy.
- Challenge: Modern neural networks, especially large language models, are often poorly calibrated and overconfident.
- Method: Techniques like temperature scaling and Platt scaling are post-processing methods used to adjust a model's output probabilities.
Expected Calibration Error (ECE)
Expected Calibration Error (ECE) is a scalar metric that quantifies the miscalibration of a probabilistic model. It is the primary quantitative score derived from a calibration curve.
- Calculation: Predictions are binned by confidence (e.g., 0.0-0.1, 0.1-0.2). ECE computes a weighted average of the absolute difference between average confidence and average accuracy within each bin.
- Interpretation: An ECE of 0.05 means the model's confidence deviates from its accuracy by 5 percentage points, on average. Lower is better.
- Limitation: ECE depends on binning strategy. Adaptive Calibration Error (ACE) is a variant that uses equal-size bins by sample count.
Selective Prediction
Selective prediction (or rejection option) is a technique where a model abstains from making a prediction when its confidence is below a predefined threshold. This directly leverages calibration to improve operational reliability.
- Mechanism: A calibrated confidence score is used as a gating function. If confidence < threshold, the system returns "I don't know" or routes the query to a fallback.
- Trade-off: Creates a coverage-accuracy curve. As the system abstains more (lower coverage), the accuracy on the remaining predictions increases.
- Use Case: Critical for production AI where incorrect low-confidence outputs are costlier than no output.
Brier Score
The Brier Score is a proper scoring rule that measures the overall accuracy of probabilistic predictions. It evaluates both calibration and refinement (sharpness).
- Formula: For binary classification, Brier Score = Mean Squared Error between predicted probability and the actual outcome (0 or 1).
- Interpretation: Scores range from 0 to 1. A perfect model has a Brier Score of 0. It penalizes both overconfidence (e.g., predicting 0.9 for a wrong answer) and underconfidence (e.g., predicting 0.5 for a clear-cut case).
- Relation to Calibration: A model can have a good Brier Score but poor calibration if errors cancel out. Calibration curves provide the diagnostic detail the Brier Score summarizes.
Uncertainty Quantification
Uncertainty Quantification (UQ) is the broader field of measuring and interpreting the doubt a model has in its predictions. Calibration is one aspect of UQ.
- Aleatoric Uncertainty: Irreducible uncertainty inherent in the data (e.g., sensor noise).
- Epistemic Uncertainty: Reducible uncertainty due to model ignorance, often from lack of training data.
- Methods:
- Bayesian Neural Networks: Model weight distributions.
- Ensembles: Use prediction variance across multiple models.
- Monte Carlo Dropout: Perform multiple stochastic forward passes at inference.
- Goal: Provide a well-calibrated uncertainty estimate that an agent can use to gauge trust in its own output and trigger corrective actions.
Self-Critique Mechanism
A self-critique mechanism is a component within an autonomous agent that enables it to generate a critical analysis of its own reasoning or output. Calibration provides the confidence signal that can trigger this critique.
- Process: After generating an initial output, the agent assesses its own confidence. Low or miscalibrated confidence can activate a separate "critic" module.
- Critique Actions: The critic may identify logical flaws, check for hallucinations, verify against retrieved facts, or flag potential biases.
- Integration with Calibration: A calibrated confidence score acts as a meta-cognitive signal, helping the agent decide when to invest compute in self-critique. High-confidence, well-calibrated outputs may bypass costly verification.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us