Inferensys

Glossary

Temperature Scaling

Temperature scaling is a simple, single-parameter post-hoc calibration technique that divides a model's logits by a learned scalar 'temperature' to adjust the sharpness of the output softmax distribution.
ML engineer running AI model benchmarks, performance charts on multiple screens, late night home office setup.
CONFIDENCE SCORING FOR OUTPUTS

What is Temperature Scaling?

Temperature scaling is a post-hoc calibration technique that adjusts a neural network's output probabilities to better reflect its true accuracy.

Temperature scaling is a single-parameter, post-processing method used to calibrate the confidence estimates of a trained neural network classifier. It works by dividing the model's raw output logits by a learned scalar parameter ( T ) (the 'temperature') before applying the softmax function, which sharpens or softens the resulting probability distribution. A temperature ( T > 1 ) flattens the distribution, increasing entropy and reducing overconfidence, while ( T < 1 ) sharpens it. The optimal temperature is typically found by minimizing the negative log-likelihood (NLL) on a separate validation set.

This technique directly addresses miscalibration, where a model's predicted confidence scores do not match its empirical accuracy—a common issue in modern deep networks. Unlike more complex methods like Platt scaling or training Bayesian neural networks, temperature scaling is remarkably simple, preserves the model's original accuracy ranking, and requires minimal computational overhead. It is a foundational tool in uncertainty quantification (UQ) pipelines, enabling more reliable selective classification and downstream decision-making based on model confidence.

POST-HOC CALIBRATION

Key Characteristics of Temperature Scaling

Temperature scaling is a single-parameter technique applied after a model is trained to adjust the sharpness of its output probability distribution, improving the alignment between predicted confidence and empirical accuracy.

01

Mathematical Foundation

Temperature scaling operates by dividing the logits (the raw, unnormalized outputs from a model's final layer) by a learned scalar parameter T (temperature) before applying the softmax function.

  • Formula: softmax(z_i / T) where z_i are the logits.
  • Effect: When T > 1, the output probability distribution becomes 'softer' (more uniform), reducing overconfidence. When T < 1, the distribution becomes 'sharper' (more peaky), increasing confidence.
  • The optimal temperature T* is found by minimizing negative log-likelihood (NLL) on a separate validation set, distinct from the training set.
02

Single-Parameter Simplicity

Its primary advantage is extreme simplicity. Unlike other calibration methods, it introduces only one global parameter to tune.

  • Efficiency: The optimization for T is a convex problem on the validation set, typically solved quickly with a method like gradient descent.
  • Preservation of Accuracy: Because it applies a monotonic transformation to the logits, it does not change the model's predicted class ranking. The argmax of the probabilities remains unchanged, meaning classification accuracy is preserved while calibration is improved.
  • This makes it a highly efficient, low-risk first step in any calibration pipeline.
03

Impact on Confidence & Calibration

The core goal is to correct miscalibration, where a model's predicted confidence does not match its true likelihood of being correct (e.g., predicting class A with 90% confidence but being right only 70% of the time).

  • Corrects Overconfidence: Modern neural networks, especially large ones, are frequently overconfident. A temperature T > 1 (common) systematically scales down high confidences.
  • Measured by ECE: Improvement is quantified by metrics like Expected Calibration Error (ECE), which bins predictions by confidence and measures the gap between average confidence and accuracy within each bin. Temperature scaling directly minimizes this gap.
  • It primarily addresses confidence sharpness, not underlying model uncertainty.
04

Limitations and Scope

While powerful for its simplicity, temperature scaling has defined boundaries.

  • Cannot Fix All Miscalibration: It assumes the model's miscalibration is isotropic (similar across all confidence levels and classes). It cannot correct more complex, class-specific miscalibration patterns that methods like Platt Scaling (which uses a logistic regression per class) might address.
  • Does Not Improve Accuracy: It is a post-hoc method. It cannot improve the model's fundamental discriminative power or correct systematic errors in its predictions.
  • Separate from Uncertainty Quantification: It adjusts the scale of existing probabilities but does not generate new measures of epistemic uncertainty (model uncertainty) like Bayesian Neural Networks or Deep Ensembles do.
05

Relationship to Other Techniques

Temperature scaling is a foundational block within a broader calibration and uncertainty toolkit.

  • Vs. Platt Scaling: Platt scaling fits a logistic regression to logits, offering more flexibility (slope & intercept) but risks overfitting on small validation sets. Temperature scaling is more constrained and stable.
  • Complement to Ensembles: It is often applied to each member of a Deep Ensemble before averaging, calibrating the individual models first.
  • Precursor to Selective Classification: A well-calibrated confidence score from temperature scaling is crucial for selective classification (rejection option), where the model abstains on low-confidence inputs.
  • Used with Label Smoothing: Models trained with label smoothing (a regularization technique) often already produce better-calibrated logits, which temperature scaling can further refine.
06

Practical Implementation Steps

Implementing temperature scaling involves a straightforward, three-step process.

  1. Train Model: Train your classifier as usual on the training set.
  2. Learn Temperature: On a held-out validation set (not used for training):
    • Collect the model's logits and the true labels.
    • Optimize the scalar temperature parameter T by minimizing the Negative Log-Likelihood (NLL) loss. This is typically a one-dimensional optimization problem.
  3. Apply at Inference: During deployment, divide all output logits by the learned T before applying the softmax function to obtain calibrated probabilities.
  • Critical Note: The test set must be completely unseen during both training and temperature learning to obtain an unbiased evaluation of calibration performance.
POST-HOC CALIBRATION COMPARISON

Temperature Scaling vs. Other Calibration Methods

A comparison of common techniques used to adjust a model's predicted probabilities to better reflect its true empirical accuracy.

Method / FeatureTemperature ScalingPlatt ScalingIsotonic RegressionBayesian Methods (e.g., MC Dropout, Ensembles)

Core Principle

Applies a single scalar divisor to all logits.

Fits a logistic regression to the model's scores.

Fits a non-decreasing (isotonic) function to map scores to probabilities.

Treats model parameters as distributions; estimates uncertainty via inference.

Parameters Learned

1 (temperature, T)

2 (slope & intercept)

Many (piecewise constant function)

Many (full approximate posterior over weights)

Computational Overhead

Very Low (< 1 sec)

Low (~1-5 sec)

Medium (~5-30 sec)

Very High (10-100x inference cost)

Data Requirements

Small validation set (~1k samples)

Small validation set (~1k samples)

Larger validation set (>5k samples)

Training/validation set; multiple forward passes

Calibration Guarantees

Improves calibration but not optimal for all distributions.

Optimal for binary classification under specific score distributions.

Non-parametric; can model any monotonic distortion.

Provides principled uncertainty estimates with theoretical guarantees.

Handles Multi-Class

Preserves Accuracy

Primary Use Case

Fast, simple calibration for modern neural networks.

Binary classification with SVMs or other scoring classifiers.

Complex, non-linear miscalibration patterns.

Safety-critical applications requiring full uncertainty decomposition.

Outputs Calibrated Probabilities

Estimates Epistemic Uncertainty

TEMPERATURE SCALING

Frequently Asked Questions

Temperature scaling is a foundational technique in machine learning for calibrating a model's confidence scores. These questions address its core mechanics, applications, and relationship to broader concepts in uncertainty quantification.

Temperature scaling is a post-hoc calibration technique that adjusts a neural network's output softmax distribution by dividing the logits (the model's raw, pre-softmax scores) by a single learned scalar parameter called the temperature (T). The operation is defined as: softmax(logits / T). A temperature T > 1 softens the distribution, making it less confident (more uniform), while T < 1 sharpens it, making it more confident. The optimal temperature is learned on a separate validation set by minimizing a proper scoring rule like negative log-likelihood (NLL). It does not change the model's predicted class ranking (argmax), only the confidence probabilities assigned to each class.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.