A reliability diagram is a graphical tool for evaluating the calibration of a probabilistic classifier. It visually compares a model's predicted confidence scores to its actual empirical accuracy. Predictions are grouped into confidence bins (e.g., 0.0-0.1, 0.1-0.2). The diagram plots the average predicted confidence per bin against the observed fraction of correct predictions (empirical accuracy) within that same bin. A perfectly calibrated model's plot follows the diagonal line, where confidence equals accuracy.
Glossary
Reliability Diagram

What is a Reliability Diagram?
A reliability diagram is a visual diagnostic plot used to assess a classifier's calibration, where predicted confidence scores are binned and plotted against the observed empirical accuracy within each bin.
Significant deviation from the diagonal indicates miscalibration. A plot above the diagonal signifies underconfidence (accuracy exceeds stated confidence), while a plot below signifies overconfidence (confidence exceeds accuracy). The diagram provides an intuitive visual complement to scalar metrics like Expected Calibration Error (ECE). It is a foundational diagnostic in uncertainty quantification, informing the need for post-hoc calibration techniques like Platt scaling or temperature scaling to produce trustworthy confidence scores for downstream decision-making.
Key Characteristics of a Reliability Diagram
A reliability diagram is a visual diagnostic plot used to assess a classifier's calibration, where predicted confidence scores are binned and plotted against the observed empirical accuracy within each bin. Its key characteristics reveal the nature and degree of miscalibration.
Binning Strategy
The x-axis is constructed by partitioning the model's predicted confidence scores (e.g., 0.0-0.1, 0.1-0.2) into M equally spaced bins. Each prediction is assigned to a bin based on its confidence. The choice of bin number (e.g., M=10) is a hyperparameter; too few bins oversmooth the calibration curve, while too many introduce high variance. Common strategies include equal-width bins or bins with an equal number of samples (equal-frequency binning).
Empirical Accuracy vs. Confidence
For each bin, two key values are calculated:
- Average Confidence: The mean of the predicted confidence scores for all samples in the bin.
- Empirical Accuracy: The proportion of samples in the bin where the model's predicted class matches the true label.
A perfectly calibrated model will have points where
Average Confidence = Empirical Accuracyfor every bin, resulting in points lying directly on the diagonaly = xline of the plot.
The Calibration Curve
The central element of the diagram is the calibration curve, formed by plotting the average confidence (x-coordinate) against the empirical accuracy (y-coordinate) for each bin. The shape of this curve relative to the diagonal reveals the type of miscalibration:
- Underconfidence: Curve lies above the diagonal (accuracy > confidence).
- Overconfidence: Curve lies below the diagonal (confidence > accuracy).
- Systematic Bias: A consistent offset from the diagonal across all confidence levels.
The Ideal Diagonal
The diagonal line y = x represents the ideal state of perfect calibration. It serves as the visual benchmark. The distance of the calibration curve from this line provides an immediate, intuitive assessment of miscalibration. A model can have high accuracy but still be poorly calibrated if its curve deviates significantly from the diagonal, indicating its confidence scores are not reliable probability estimates.
Histogram of Predictions
Often displayed as a bar chart beneath the main calibration curve, the histogram shows the distribution of predictions across the confidence bins. This reveals where the model's predictions are concentrated. A model that makes most predictions with very high confidence (e.g., bins 0.9-1.0) but is miscalibrated there is particularly problematic, as its most certain predictions are wrong. It highlights the confidence distribution of the classifier.
Link to Calibration Error Metrics
The diagram provides the visual foundation for scalar calibration error metrics. The Expected Calibration Error (ECE) is directly computed from the binned data shown in the diagram: ECE = Σ (|B_m| / n) * |acc(B_m) - conf(B_m)|, where the sum is over all bins M. The Maximum Calibration Error (MCE) is the maximum observed discrepancy across all bins. The diagram makes these abstract metrics concrete by showing exactly where and how the miscalibration occurs.
Common Miscalibration Patterns in Reliability Diagrams
This table categorizes and describes typical shapes observed in reliability diagrams, which indicate systematic miscalibration in a classifier's confidence scores.
| Pattern Name | Visual Signature | Interpretation | Common Causes | Potential Mitigations |
|---|---|---|---|---|
Overconfident / Optimistic | Points consistently below the diagonal (y=x) line | Model's predicted confidence is higher than its actual empirical accuracy. | Overfitting, lack of regularization, training with cross-entropy loss without calibration. | Temperature scaling, label smoothing, Platt scaling, train with a proper scoring rule (e.g., Brier score). |
Underconfident / Pessimistic | Points consistently above the diagonal (y=x) line | Model's predicted confidence is lower than its actual empirical accuracy. | Excessive regularization (e.g., high weight decay), underfitting, using label smoothing with a very high smoothing factor. | Reduce regularization, adjust label smoothing parameter, Platt scaling. |
Sigmoidal / S-Shaped | Points form an 'S' shape around the diagonal | Model is overconfident at high and low confidence regions but underconfident in mid-range confidences. | Inherent biases in the model architecture or training algorithm that distort probability distributions. | Isotonic regression, Bayesian binning into quantiles (BBQ), more flexible parametric calibration methods. |
Inverse Sigmoid / Reverse S-Shaped | Points form an inverted 'S' shape around the diagonal | Model is underconfident at high and low confidence regions but overconfident in mid-range confidences. | Less common, but can result from specific dataset artifacts or miscalibrated post-processing. | Isotonic regression, non-parametric calibration. |
Bimodal / U-Shaped | Points are high at the extremes (near 0.0 and 1.0) and low in the middle, forming a 'U' | Model rarely predicts with mid-level confidence; it tends to be very certain or very uncertain, often incorrectly. | Training on datasets with label noise or ambiguity, models that collapse predictions to extremes. | Collect better-annotated data, use loss functions robust to label noise, adjust the temperature parameter. |
Systematic Bias / Offset | All points are shifted vertically (consistently above or below) but maintain a roughly linear relationship. | A constant bias is added to all confidence estimates. | A miscalibrated baseline in the model's output layer (logit bias). | Platt scaling (which learns a bias term), recalibrate the output layer. |
High Variance / Unreliable | Points show large, unsystematic scatter with no clear relationship to the diagonal. | The model's confidence scores are not reliable indicators of accuracy; the calibration is very poor. | Extreme overfitting on a small dataset, very high model capacity without sufficient data, evaluating on out-of-distribution data. | Collect more training data, use model ensembles (reduces variance), apply strong regularization, ensure test distribution matches training. |
Frequently Asked Questions
A reliability diagram is a core diagnostic tool for evaluating the calibration of a probabilistic classifier. These questions address its construction, interpretation, and role in building trustworthy machine learning systems.
A reliability diagram is a visual diagnostic plot used to assess the calibration of a probabilistic classifier by comparing its predicted confidence scores against the observed empirical accuracy.
It works by:
- Binning Predictions: The classifier's predictions are sorted by their predicted confidence (e.g., a score between 0 and 1) and partitioned into a fixed number of bins (e.g., 10 bins of width 0.1).
- Calculating Bin Statistics: For each bin, the average predicted confidence (the mean of the scores in that bin) is computed and plotted on the x-axis.
- Calculating Empirical Accuracy: For each bin, the actual observed accuracy (the fraction of samples where the predicted class was correct) is computed and plotted on the y-axis.
- Plotting and Analysis: The resulting points are plotted. A perfectly calibrated classifier, where confidence equals accuracy, will have all points lying on the diagonal line y = x. Deviations from this diagonal visually represent miscalibration—points above the diagonal indicate underconfidence, while points below indicate overconfidence.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
A reliability diagram is a core diagnostic tool within the broader field of model calibration and uncertainty quantification. These related concepts are essential for building trustworthy, self-evaluating AI systems.
Calibration Error
Calibration error is the quantitative discrepancy between a model's predicted confidence scores and its actual empirical accuracy. It measures how well a confidence score of X% corresponds to a X% chance of being correct.
- Expected Calibration Error (ECE) is the most common scalar summary, calculated by binning predictions and averaging the absolute difference between average confidence and accuracy per bin.
- Proper scoring rules, like the Brier Score, are loss functions that directly penalize miscalibration during training.
Uncertainty Quantification (UQ)
Uncertainty Quantification (UQ) is the field focused on measuring and interpreting the different types of uncertainty in a model's predictions. A reliability diagram primarily visualizes total predictive uncertainty.
- Aleatoric Uncertainty is inherent, irreducible noise in the data (e.g., sensor error).
- Epistemic Uncertainty stems from the model's lack of knowledge, reducible with more data.
- Methods like Monte Carlo Dropout and Deep Ensembles are used to estimate these uncertainties.
Post-Hoc Calibration
Post-hoc calibration refers to techniques applied to a trained model's outputs to improve their probabilistic reliability without retraining. These methods are validated using reliability diagrams.
- Platt Scaling fits a logistic regression model to map raw scores (logits) to calibrated probabilities.
- Temperature Scaling is a simpler, single-parameter variant of Platt scaling that adjusts the softmax distribution's sharpness.
- Isotonic Regression is a non-parametric method that learns a piecewise constant calibration map.
Selective Classification
Selective classification, or classification with a rejection option, allows a model to abstain from making a prediction when its confidence is below a threshold. Reliability diagrams inform the setting of this threshold.
- A Risk-Coverage Curve plots the error rate (risk) against the fraction of samples predicted (coverage), defining the accuracy-abstention trade-off.
- This is critical for high-stakes applications where an incorrect but confident prediction is dangerous.
Proper Scoring Rules
A proper scoring rule is a function that evaluates the quality of a probabilistic forecast, encouraging the forecaster to report their true belief. They are the training objectives that calibration methods aim to optimize.
- Negative Log-Likelihood (NLL) is the standard proper score used for training classification models.
- Brier Score is the mean squared error between the predicted probability vector and the one-hot encoded true label.
- Minimizing these scores during training generally improves, but does not guarantee, calibration.
Conformal Prediction
Conformal prediction is a model-agnostic framework that produces prediction sets (not single labels) with guaranteed statistical coverage. It provides a rigorous, distribution-free alternative to confidence scores.
- It guarantees that the true label will be contained in the prediction set, say, 95% of the time.
- Conformal Quantile Regression extends this to regression tasks.
- Unlike reliability diagrams which assess calibration, conformal prediction enforces it by construction.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us