Inferensys

Glossary

Reliability Diagram

A reliability diagram is a visual diagnostic plot used to assess a classifier's calibration, where predicted confidence scores are binned and plotted against the observed empirical accuracy within each bin.
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
CALIBRATION DIAGNOSTIC

What is a Reliability Diagram?

A reliability diagram is a visual diagnostic plot used to assess a classifier's calibration, where predicted confidence scores are binned and plotted against the observed empirical accuracy within each bin.

A reliability diagram is a graphical tool for evaluating the calibration of a probabilistic classifier. It visually compares a model's predicted confidence scores to its actual empirical accuracy. Predictions are grouped into confidence bins (e.g., 0.0-0.1, 0.1-0.2). The diagram plots the average predicted confidence per bin against the observed fraction of correct predictions (empirical accuracy) within that same bin. A perfectly calibrated model's plot follows the diagonal line, where confidence equals accuracy.

Significant deviation from the diagonal indicates miscalibration. A plot above the diagonal signifies underconfidence (accuracy exceeds stated confidence), while a plot below signifies overconfidence (confidence exceeds accuracy). The diagram provides an intuitive visual complement to scalar metrics like Expected Calibration Error (ECE). It is a foundational diagnostic in uncertainty quantification, informing the need for post-hoc calibration techniques like Platt scaling or temperature scaling to produce trustworthy confidence scores for downstream decision-making.

VISUAL DIAGNOSTIC

Key Characteristics of a Reliability Diagram

A reliability diagram is a visual diagnostic plot used to assess a classifier's calibration, where predicted confidence scores are binned and plotted against the observed empirical accuracy within each bin. Its key characteristics reveal the nature and degree of miscalibration.

01

Binning Strategy

The x-axis is constructed by partitioning the model's predicted confidence scores (e.g., 0.0-0.1, 0.1-0.2) into M equally spaced bins. Each prediction is assigned to a bin based on its confidence. The choice of bin number (e.g., M=10) is a hyperparameter; too few bins oversmooth the calibration curve, while too many introduce high variance. Common strategies include equal-width bins or bins with an equal number of samples (equal-frequency binning).

02

Empirical Accuracy vs. Confidence

For each bin, two key values are calculated:

  • Average Confidence: The mean of the predicted confidence scores for all samples in the bin.
  • Empirical Accuracy: The proportion of samples in the bin where the model's predicted class matches the true label. A perfectly calibrated model will have points where Average Confidence = Empirical Accuracy for every bin, resulting in points lying directly on the diagonal y = x line of the plot.
03

The Calibration Curve

The central element of the diagram is the calibration curve, formed by plotting the average confidence (x-coordinate) against the empirical accuracy (y-coordinate) for each bin. The shape of this curve relative to the diagonal reveals the type of miscalibration:

  • Underconfidence: Curve lies above the diagonal (accuracy > confidence).
  • Overconfidence: Curve lies below the diagonal (confidence > accuracy).
  • Systematic Bias: A consistent offset from the diagonal across all confidence levels.
04

The Ideal Diagonal

The diagonal line y = x represents the ideal state of perfect calibration. It serves as the visual benchmark. The distance of the calibration curve from this line provides an immediate, intuitive assessment of miscalibration. A model can have high accuracy but still be poorly calibrated if its curve deviates significantly from the diagonal, indicating its confidence scores are not reliable probability estimates.

05

Histogram of Predictions

Often displayed as a bar chart beneath the main calibration curve, the histogram shows the distribution of predictions across the confidence bins. This reveals where the model's predictions are concentrated. A model that makes most predictions with very high confidence (e.g., bins 0.9-1.0) but is miscalibrated there is particularly problematic, as its most certain predictions are wrong. It highlights the confidence distribution of the classifier.

06

Link to Calibration Error Metrics

The diagram provides the visual foundation for scalar calibration error metrics. The Expected Calibration Error (ECE) is directly computed from the binned data shown in the diagram: ECE = Σ (|B_m| / n) * |acc(B_m) - conf(B_m)|, where the sum is over all bins M. The Maximum Calibration Error (MCE) is the maximum observed discrepancy across all bins. The diagram makes these abstract metrics concrete by showing exactly where and how the miscalibration occurs.

DIAGNOSTIC PATTERNS

Common Miscalibration Patterns in Reliability Diagrams

This table categorizes and describes typical shapes observed in reliability diagrams, which indicate systematic miscalibration in a classifier's confidence scores.

Pattern NameVisual SignatureInterpretationCommon CausesPotential Mitigations

Overconfident / Optimistic

Points consistently below the diagonal (y=x) line

Model's predicted confidence is higher than its actual empirical accuracy.

Overfitting, lack of regularization, training with cross-entropy loss without calibration.

Temperature scaling, label smoothing, Platt scaling, train with a proper scoring rule (e.g., Brier score).

Underconfident / Pessimistic

Points consistently above the diagonal (y=x) line

Model's predicted confidence is lower than its actual empirical accuracy.

Excessive regularization (e.g., high weight decay), underfitting, using label smoothing with a very high smoothing factor.

Reduce regularization, adjust label smoothing parameter, Platt scaling.

Sigmoidal / S-Shaped

Points form an 'S' shape around the diagonal

Model is overconfident at high and low confidence regions but underconfident in mid-range confidences.

Inherent biases in the model architecture or training algorithm that distort probability distributions.

Isotonic regression, Bayesian binning into quantiles (BBQ), more flexible parametric calibration methods.

Inverse Sigmoid / Reverse S-Shaped

Points form an inverted 'S' shape around the diagonal

Model is underconfident at high and low confidence regions but overconfident in mid-range confidences.

Less common, but can result from specific dataset artifacts or miscalibrated post-processing.

Isotonic regression, non-parametric calibration.

Bimodal / U-Shaped

Points are high at the extremes (near 0.0 and 1.0) and low in the middle, forming a 'U'

Model rarely predicts with mid-level confidence; it tends to be very certain or very uncertain, often incorrectly.

Training on datasets with label noise or ambiguity, models that collapse predictions to extremes.

Collect better-annotated data, use loss functions robust to label noise, adjust the temperature parameter.

Systematic Bias / Offset

All points are shifted vertically (consistently above or below) but maintain a roughly linear relationship.

A constant bias is added to all confidence estimates.

A miscalibrated baseline in the model's output layer (logit bias).

Platt scaling (which learns a bias term), recalibrate the output layer.

High Variance / Unreliable

Points show large, unsystematic scatter with no clear relationship to the diagonal.

The model's confidence scores are not reliable indicators of accuracy; the calibration is very poor.

Extreme overfitting on a small dataset, very high model capacity without sufficient data, evaluating on out-of-distribution data.

Collect more training data, use model ensembles (reduces variance), apply strong regularization, ensure test distribution matches training.

RELIABILITY DIAGRAM

Frequently Asked Questions

A reliability diagram is a core diagnostic tool for evaluating the calibration of a probabilistic classifier. These questions address its construction, interpretation, and role in building trustworthy machine learning systems.

A reliability diagram is a visual diagnostic plot used to assess the calibration of a probabilistic classifier by comparing its predicted confidence scores against the observed empirical accuracy.

It works by:

  1. Binning Predictions: The classifier's predictions are sorted by their predicted confidence (e.g., a score between 0 and 1) and partitioned into a fixed number of bins (e.g., 10 bins of width 0.1).
  2. Calculating Bin Statistics: For each bin, the average predicted confidence (the mean of the scores in that bin) is computed and plotted on the x-axis.
  3. Calculating Empirical Accuracy: For each bin, the actual observed accuracy (the fraction of samples where the predicted class was correct) is computed and plotted on the y-axis.
  4. Plotting and Analysis: The resulting points are plotted. A perfectly calibrated classifier, where confidence equals accuracy, will have all points lying on the diagonal line y = x. Deviations from this diagonal visually represent miscalibration—points above the diagonal indicate underconfidence, while points below indicate overconfidence.
Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.