Negative Log-Likelihood (NLL) is a proper scoring rule that measures the quality of a model's probabilistic predictions by penalizing low probability assigned to the correct class or outcome. It is calculated as the negative logarithm of the likelihood function, which is the probability the model assigns to the observed data. For a perfect, perfectly confident prediction, NLL is zero; it increases as the model's predicted probability for the true label decreases, with the penalty growing sharply for high-confidence errors. This mathematical property makes it a strictly proper scoring rule, incentivizing the model to output its true, well-calibrated confidence.
Glossary
Negative Log-Likelihood (NLL)

What is Negative Log-Likelihood (NLL)?
Negative Log-Likelihood (NLL) is a fundamental loss function and evaluation metric in machine learning that quantifies the quality of a model's probabilistic predictions.
In practice, NLL serves as the standard loss function for training classification models like neural networks, where it is equivalent to cross-entropy loss. It directly optimizes the model's parameters to maximize the likelihood of the training data. Beyond training, NLL is a core calibration metric; a lower NLL on a held-out test set indicates the model's confidence scores are more reliable. It is closely related to other evaluation concepts like log loss and is a key component in broader frameworks such as Bayesian model selection, where it approximates the marginal likelihood.
Key Properties of NLL
Negative Log-Likelihood (NLL) is a fundamental loss function and evaluation metric. Its mathematical properties make it uniquely suited for training and assessing calibrated probabilistic models.
Proper Scoring Rule
NLL is a proper scoring rule, meaning it is minimized when a model reports its true, underlying probability distribution. This property incentivizes honest, well-calibrated predictions, as a model cannot achieve a lower loss by being overconfident or underconfident. It is the standard loss function for training classification models like logistic regression and neural networks with a softmax output layer.
Decomposition into Calibration & Refinement
The NLL can be conceptually decomposed into two components: calibration loss and refinement loss.
- Calibration Loss: Measures how closely the predicted probabilities match the empirical frequencies of outcomes.
- Refinement Loss (or Sharpness): Measures the concentration of the predictive distributions; a model that makes decisive (high-confidence) correct predictions has good refinement. A perfect model minimizes both, achieving low NLL through accurate and confident predictions.
Information-Theoretic Interpretation
NLL has a direct interpretation from information theory: it measures the cross-entropy between the true data distribution and the model's predicted distribution. In essence, it quantifies the average number of nats (or bits, if using log base 2) required to encode the true labels using the model's probability distribution. A lower NLL indicates the model's distribution is more efficient at describing the data.
Sensitivity to Probability Extremes
Due to the logarithm, NLL heavily penalizes high-confidence errors. If a model assigns a probability near 0.0 to the correct class, the negative log of that tiny probability becomes a very large positive loss. This characteristic makes NLL an excellent tool for detecting and punishing overconfident miscalibration, which is critical for safety in high-stakes applications.
Comparison to Brier Score
While both NLL and the Brier Score are proper scoring rules, they emphasize different aspects of probabilistic predictions.
- NLL: Uses a logarithmic penalty, making it more sensitive to errors in predicted probabilities, especially near 0 or 1.
- Brier Score: Uses a squared-error penalty, making it more sensitive to changes in the middle of the probability range (e.g., from 0.5 to 0.6). NLL is generally preferred for model training and comparison in classification, while the Brier Score is often used for evaluation and diagnostics due to its simpler decomposition.
Role in Calibration Assessment
Although NLL itself is a single number, tracking its value on a held-out validation set is a primary method for tuning calibration techniques like temperature scaling. The temperature parameter that minimizes the NLL on the calibration set is typically optimal. However, a low NLL does not guarantee perfect calibration on its own; it must be used alongside diagnostic tools like reliability diagrams and Expected Calibration Error (ECE) for a complete assessment.
NLL vs. Other Loss Functions
A feature and application comparison of Negative Log-Likelihood (NLL) with other common loss functions used in machine learning, highlighting their suitability for different model types and calibration objectives.
| Feature / Metric | Negative Log-Likelihood (NLL) | Cross-Entropy Loss | Mean Squared Error (MSE) | Focal Loss |
|---|---|---|---|---|
Primary Use Case | Training & evaluating probabilistic classifiers | Training multi-class classifiers | Regression tasks, linear models | Training on imbalanced datasets |
Outputs Calibrated Probabilities | ||||
Proper Scoring Rule | ||||
Directly Penalizes Overconfidence | ||||
Common Base for Post-Hoc Calibration (e.g., Temp Scaling) | ||||
Handles Class Imbalance Natively | ||||
Differentiable | ||||
Interpretation of Value | Log-likelihood of true labels | Divergence from true distribution | Average squared error | Weighted cross-entropy |
Typical Model Architecture | Final softmax layer | Final softmax layer | Linear output layer | Final softmax layer |
Practical Applications of NLL
Negative Log-Likelihood (NLL) is a fundamental loss function and evaluation metric with critical applications across the machine learning lifecycle, from training to production monitoring.
Training Probabilistic Classifiers
NLL is the standard loss function for training models that output probability distributions, such as those using a final softmax layer. It directly penalizes the model for assigning low probability to the correct class, encouraging well-calibrated confidence scores.
- Core Mechanism: For a correct class with predicted probability (p), the loss is (-\log(p)). As (p) approaches 1, loss approaches 0; as (p) approaches 0, loss grows rapidly.
- Example: In a 10-class image classification task, if the model assigns a probability of 0.9 to the correct 'cat' class, the NLL contribution for that sample is (-\log(0.9) \approx 0.105). If it incorrectly assigns only 0.1, the loss is (-\log(0.1) \approx 2.302).
- Impact: Minimizing NLL during training inherently optimizes for both accuracy (by rewarding high (p) for correct classes) and calibration (by discouraging overconfident wrong predictions).
Benchmarking Model Calibration
As a proper scoring rule, NLL serves as a primary quantitative metric for evaluating the calibration of a trained model on a held-out validation or test set. Lower NLL indicates better-calibrated probabilistic predictions.
- Comparison to Accuracy: Accuracy measures how often the top prediction is correct but ignores confidence. NLL evaluates the entire predicted distribution.
- Diagnostic Use: A model with high accuracy but a high (poor) NLL score is likely overconfident—it is frequently correct but with unjustifiably high certainty, which is risky for downstream decision-making.
- Standard Practice: In calibration research, NLL is reported alongside metrics like Expected Calibration Error (ECE) and Brier Score to provide a comprehensive view of predictive quality.
Comparing Post-Hoc Calibration Methods
NLL is the objective function used to fit and select post-hoc calibration techniques like Temperature Scaling, Platt Scaling, and Isotonic Regression.
- Calibration Set Optimization: A held-out calibration set (not used for training) is passed through the base model. The parameters of the calibration transform (e.g., the temperature scalar
T) are optimized by minimizing the NLL on this set. - Method Selection: The performance of different calibration algorithms is compared by evaluating the NLL on a separate validation set after applying each fitted method. The technique yielding the lowest NLL is typically preferred.
- Example: For Temperature Scaling, the single parameter
Tis tuned via gradient descent to minimize NLL, effectively 'softening' (T > 1) or 'sharpening' (T < 1) the model's output distribution.
Evaluating Generative Language Models
For autoregressive Large Language Models (LLMs), NLL (often reported as perplexity, which is the exponential of the average NLL) is a core metric for evaluating language modeling proficiency and comparing model architectures.
- Mechanism: The model predicts the next token in a sequence. The NLL is computed over the entire sequence as the sum of losses for each token given its preceding context.
- Perplexity: Perplexity = (\exp(\text{average NLL})). A lower perplexity indicates the model is less 'surprised' by the text and assigns higher probability to natural language sequences.
- Application: Used to benchmark foundational language understanding, select optimal model checkpoints during training, and evaluate the impact of different training data or architectural choices.
Production Monitoring for Calibration Drift
In MLOps pipelines, tracking the NLL of production model inferences over time is a key signal for detecting calibration drift—a degradation in the model's confidence reliability due to data distribution shifts.
- Monitoring SLO: A sustained increase in production NLL, even if accuracy remains stable, signals that the model's confidence scores are becoming less trustworthy, which can undermine automated decision systems.
- Trigger for Retraining/Recalibration: An upward trend in NLL can serve as a trigger to collect new calibration data and reapply post-hoc calibration or to initiate full model retraining.
- Example: A fraud detection model may maintain high accuracy but see its NLL rise as fraud patterns evolve, indicating it is becoming overconfident in its (still correct) predictions, masking increased uncertainty.
Informing Bayesian Deep Learning
In Bayesian neural networks, which output distributions over parameters, the NLL is combined with a KL divergence term to form the Evidence Lower Bound (ELBO), the objective function for variational inference.
- Role in ELBO: The ELBO = Expected Log-Likelihood (negative NLL) - KL( approximate posterior || prior ). Maximizing the ELBO trains the network to explain the data well (high likelihood) while keeping its parameter distribution close to a prior.
- Uncertainty Quantification: The resulting model provides predictive uncertainty. NLL evaluated on this Bayesian model assesses how well its uncertainty estimates explain the observed data.
- Application: Critical in safety-sensitive domains like medical diagnosis or autonomous driving, where understanding model uncertainty is as important as the prediction itself.
Frequently Asked Questions
Negative Log-Likelihood (NLL) is a cornerstone metric for evaluating and training probabilistic models. These questions address its core mechanics, applications, and relationship to other key concepts in machine learning.
Negative Log-Likelihood (NLL) is a proper scoring rule that quantifies the quality of a model's probabilistic predictions by penalizing low probability assigned to the correct outcome. It works by taking the natural logarithm of the predicted probability for the true label and then negating it: NLL = -log(P(y_true | x)). A perfect model assigning a probability of 1.0 to the correct class has an NLL of 0, while incorrect or uncertain predictions yield higher, unbounded positive values. During training, minimizing NLL is equivalent to maximum likelihood estimation (MLE), pushing the model to increase the probability mass on the correct answers in the training data. Its logarithmic nature heavily penalizes high-confidence errors (e.g., predicting 0.99 for a wrong class), making it a sensitive measure of calibration.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Negative Log-Likelihood (NLL) is a core component of a broader evaluation and calibration toolkit. These related concepts define the metrics, methods, and frameworks for ensuring model confidence is trustworthy.
Proper Scoring Rule
A proper scoring rule is a function that measures the quality of probabilistic predictions, designed so that a forecaster maximizes their expected score by reporting their true subjective probability. NLL is a strictly proper scoring rule, meaning it uniquely incentivizes honest reporting of the model's best estimate. Other examples include the Brier score (mean squared error for probabilities) and the spherical score. Their mathematical properties make them essential for training and evaluating calibrated classifiers, as they directly penalize overconfidence and underconfidence.
Expected Calibration Error (ECE)
Expected Calibration Error (ECE) is a scalar summary metric that quantifies miscalibration by binning predictions based on confidence and computing the weighted average of the absolute difference between a model's average predicted confidence and its empirical accuracy within each bin.
- Calculation: Partition predictions into
Mbins (e.g., 0.0-0.1, 0.1-0.2). For each bin, compute average confidence and actual accuracy. ECE is the weighted sum of|confidence - accuracy|across bins. - Purpose: Provides a single-number diagnostic. A perfectly calibrated model has an ECE of 0.
- Limitation: Depends on binning scheme and can mask finer-grained miscalibration. NLL, as a loss function, provides a continuous, unbinned measure during training.
Brier Score
The Brier score is a proper scoring rule for probabilistic predictions, defined as the mean squared error between the predicted probability vector and the one-hot encoded true label. For a single sample, it is calculated as the sum of squared differences between each predicted probability and its corresponding binary outcome.
- Interpretation: Lower scores are better, with 0 being perfect. It decomposes into calibration loss and refinement loss (sharpness).
- Comparison to NLL: Both are proper scores. The Brier score is bounded and more sensitive to probabilities near 0.5, while NLL heavily penalizes assigning extremely low probability to the correct class (log(0) → ∞). NLL is more commonly used as a training loss for classification neural networks.
Post-Hoc Calibration
Post-hoc calibration refers to techniques applied to a trained model's outputs—without retraining the model itself—to improve the alignment between its predicted confidence scores and true empirical correctness. These methods use a held-out calibration set.
Common Methods:
- Temperature Scaling: Applies a single scalar (temperature) to soften or sharpen logits before softmax.
- Platt Scaling: Fits a logistic regression model to the logits (common for binary classification).
- Isotonic Regression: Fits a non-parametric, piecewise constant function.
NLL is often used as the objective to optimize the parameters (e.g., the temperature) during this post-processing stage.
Calibration-Aware Training
Calibration-aware training integrates calibration objectives directly into the model training process, aiming to produce intrinsically well-calibrated models without needing post-hoc correction. This contrasts with applying calibration as a separate post-processing step.
Techniques include:
- Label Smoothing: Replaces hard one-hot labels with a smoothed distribution (e.g., 0.9 for true class, 0.1/(K-1) for others), which regularizes the model and reduces overconfidence.
- Focal Loss: Modifies standard cross-entropy (NLL) to down-weight well-classified examples, indirectly affecting calibration by focusing learning on harder samples.
- Explicit Regularization: Adding a penalty term to the NLL loss based on a calibration metric like MMCE (Maximum Mean Calibration Error).
Reliability Diagram
A reliability diagram is a visual diagnostic tool for assessing model calibration. It plots a model's average predicted confidence (x-axis) against its observed empirical accuracy (y-axis) across multiple confidence bins.
- Interpretation: Points on the diagonal (y=x) indicate perfect calibration. Points above the diagonal signify underconfidence (accuracy exceeds confidence). Points below signify overconfidence (confidence exceeds accuracy).
- Usage: Provides an intuitive, graphical complement to scalar metrics like ECE or NLL. It helps identify where in the confidence spectrum miscalibration occurs. The diagram is constructed from the same binned data used to compute ECE.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us