A proper scoring rule is a function that assigns a numerical score to a probabilistic forecast, where the expected score is optimized only when the forecaster reports their true belief about the likelihood of an event. This property, known as strict propriety, provides an incentive for honest reporting, making these rules foundational for training and evaluating calibrated models. Canonical examples include the Brier score for classification and the negative log-likelihood (NLL).
Glossary
Proper Scoring Rule

What is a Proper Scoring Rule?
A proper scoring rule is a function that measures the quality of probabilistic predictions, incentivizing the forecaster to report their true subjective probability.
Proper scoring rules evaluate two key aspects of a probabilistic prediction: calibration, where predicted confidence matches empirical accuracy, and refinement or sharpness, which measures how decisively the forecast distinguishes between outcomes. In machine learning, they serve as both training loss functions and post-training evaluation metrics, directly linking model optimization to the goal of producing reliable, truthful confidence estimates.
Key Examples of Proper Scoring Rules
Proper scoring rules are the fundamental metrics for evaluating probabilistic predictions. These canonical examples each measure a different aspect of forecast quality while incentivizing honesty.
Brier Score
The Brier score is a proper scoring rule for binary and multi-class predictions, defined as the mean squared error between the predicted probability vector and the one-hot encoded true outcome. For binary classification with predicted probability (\hat{p}) and true outcome (y \in {0,1}), it is calculated as (BS = \frac{1}{N} \sum_{i=1}^{N} (\hat{p}_i - y_i)^2).
- Properties: It is strictly proper, meaning a forecaster minimizes their expected score only by reporting their true belief. It simultaneously measures calibration (alignment of confidence and accuracy) and refinement (sharpness of predictions).
- Interpretation: Scores range from 0 to 2, with 0 representing a perfect forecast. A lower score indicates better predictive performance.
- Use Case: The standard metric for evaluating the overall quality of probabilistic classifiers in weather forecasting and many machine learning benchmarks.
Negative Log-Likelihood (NLL)
Negative Log-Likelihood (NLL), also known as the log score, is a strictly proper scoring rule that measures the quality of a probabilistic forecast by penalizing low probability assigned to the observed outcome. For a model predicting probability (\hat{p}(y|x)) for the true class (y), it is (NLL = -\log \hat{p}(y|x)).
- Properties: It is local, meaning the score depends only on the probability assigned to the actual outcome, not the entire distribution. It is the primary loss function used to train most modern probabilistic classifiers (e.g., via cross-entropy).
- Interpretation: A lower NLL indicates a better model. It is sensitive to extreme overconfidence; assigning near-zero probability to a correct event yields an extremely high (poor) penalty.
- Use Case: The foundational training objective and evaluation metric for classification models, from logistic regression to large neural networks.
Logarithmic Score (General Case)
The logarithmic score is the generalization of NLL to full predictive distributions, not just classification. For a probabilistic forecast represented by a density (f) and an observed outcome (y), the score is (S(f, y) = \log f(y)).
- Properties: It is the premier example of a strictly proper local scoring rule. Its propriety is foundational to information theory; maximizing the expected log score is equivalent to minimizing the Kullback-Leibler divergence between the true and forecast distributions.
- Interpretation: Higher scores (closer to zero from below) are better. It strongly discourages forecasts that are overprecise (too narrow) or that ignore plausible outcomes.
- Use Case: The standard for evaluating density forecasts in fields like econometrics, statistics, and Bayesian modeling, where predicting a full distribution is required.
Spherical Scoring Rule
The spherical scoring rule is a proper scoring rule that measures the cosine similarity between the predicted probability vector and the vector representing the true outcome. For a prediction (\mathbf{p} = (p_1, ..., p_k)) and true outcome index (j), the score is (S(\mathbf{p}, j) = \frac{p_j}{|\mathbf{p}|}), where (|\mathbf{p}|) is the Euclidean norm.
- Properties: It is bounded between 0 and 1, with 1 being a perfect prediction. Unlike the log score, it is less sensitive to extreme errors, providing a more 'gentle' penalty.
- Interpretation: Encourages forecasters to be honest but does not penalize incorrect low-probability assignments as severely as the logarithmic score. It emphasizes the relative ranking of probabilities.
- Use Case: Used in contexts where forecasters may be risk-averse to the extreme penalties of the log score, or where a bounded, interpretable score is preferred.
Ranked Probability Score (RPS)
The Ranked Probability Score (RPS) is a strictly proper scoring rule for ordinal outcomes, where categories have a natural order (e.g., 'low', 'medium', 'high'). It generalizes the Brier score by penalizing forecasts less if probability mass is placed on categories close to the true outcome.
- Calculation: For cumulative forecast distribution (F) and cumulative true distribution (G), (RPS = \sum_{i=1}^{k} (F_i - G_i)^2). It measures the squared error between the cumulative distributions.
- Properties: It respects the ordering of categories. A forecast of (0.1, 0.8, 0.1) for a 'medium' outcome scores better than (0.8, 0.1, 0.1), as the former placed more mass on the adjacent correct category.
- Use Case: The standard metric for evaluating probabilistic forecasts of ordered categorical variables, such as in weather forecasting (e.g., precipitation categories) or survey analysis (e.g., Likert scale responses).
Continuous Ranked Probability Score (CRPS)
The Continuous Ranked Probability Score (CRPS) is the generalization of the Ranked Probability Score (RPS) to continuous real-valued variables. It is a strictly proper scoring rule that compares a full predictive cumulative distribution function (CDF) against the observed value.
- Definition: (CRPS(F, y) = \int_{-\infty}^{\infty} (F(x) - \mathbb{1}{x \geq y})^2 dx), where (F) is the predictive CDF and (y) is the observed scalar. It can be interpreted as the integral of the Brier score at all probability thresholds.
- Properties: It measures both calibration and sharpness of a predictive distribution. A lower CRPS is better. It reduces to the mean absolute error if the forecast is a deterministic point.
- Use Case: The dominant metric for evaluating probabilistic forecasts of continuous quantities, ubiquitous in fields like numerical weather prediction, energy load forecasting, and quantitative finance.
How Proper Scoring Rules Work: The Incentive Property
A proper scoring rule's defining characteristic is its incentive property, which mathematically guarantees that a forecaster maximizes their expected score only by reporting their true subjective probability.
The incentive property is the formal mechanism that makes a scoring rule 'proper.' For any probabilistic forecast, the forecaster's expected score is calculated over the possible outcomes. A rule is strictly proper if this expected score is uniquely maximized when the reported probability distribution matches the forecaster's true belief. This creates a game-theoretic alignment, eliminating the strategic advantage of dishonest reporting. Canonical examples include the Brier score for classification and negative log-likelihood for general probability estimation.
This property is foundational for model calibration and evaluation-driven development. During training, using a proper scoring rule as a loss function (like cross-entropy) incentivizes the model to learn calibrated probabilities intrinsically. In evaluation, it provides a trustworthy metric for comparing probabilistic models. The property ensures that improvements in the measured score directly correspond to more accurate and honest probability estimates, which is critical for uncertainty quantification and reliable decision-making under risk.
Comparison of Common Proper Scoring Rules
A feature comparison of canonical proper scoring rules used to evaluate the quality of probabilistic predictions, highlighting their mathematical properties and typical use cases.
| Feature / Property | Brier Score | Logarithmic Score (NLL) | Spherical Score |
|---|---|---|---|
Mathematical Form (Binary) | 1/N Σ (p_i - y_i)² | -1/N Σ [y_i log(p_i) + (1-y_i)log(1-p_i)] | p_i^(y_i) (1-p_i)^(1-y_i) / √[p_i² + (1-p_i)²] |
Properness Guarantee | |||
Strictly Proper | |||
Local Scoring Rule | |||
Sensitive to Extreme Errors | |||
Interpretation | Mean Squared Error | Negative Log-Likelihood | Cosine Similarity to True Vector |
Range of Values | [0, 1] | [0, ∞) | [0, 1] |
Common Application | General classification calibration | Training loss & model comparison | Probability forecast evaluation |
Differentiable | |||
Decomposable into Components | Uncertainty + Resolution + Reliability | ||
Handles Multi-Class | |||
Penalty for Overconfidence | Quadratic | Extreme (log(0)→∞) | Moderate |
Frequently Asked Questions
A proper scoring rule is a fundamental concept in probabilistic machine learning that measures the quality of a model's predicted probabilities. These rules are 'proper' because they are designed to incentivize a forecaster to report their true, honest belief, making them essential for training and evaluating well-calibrated models.
A proper scoring rule is a function that assigns a numerical score to a probabilistic prediction, where a lower score indicates a better prediction, and the score is minimized only when the forecaster reports their true subjective probability distribution. This property ensures the rule incentivizes honesty, making it a cornerstone for evaluating and training models where calibrated uncertainty is critical. The two most canonical examples are the Brier score for classification and the negative log-likelihood (NLL) for general probabilistic forecasting.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Proper scoring rules are a foundational concept within model calibration. These related terms define the specific metrics, methods, and frameworks used to measure and achieve accurate probabilistic predictions.
Brier Score
The Brier score is a proper scoring rule that measures the mean squared error between predicted probabilities and true binary outcomes. It simultaneously evaluates calibration (how well confidence matches accuracy) and refinement (the sharpness of the predictions). A lower score indicates better performance.
- Example: For a binary classifier predicting rain with 80% confidence, and it does rain, the Brier score contribution is
(0.8 - 1)^2 = 0.04.
Negative Log-Likelihood (NLL)
Negative Log-Likelihood (NLL) is a proper scoring rule that measures the quality of a model's probabilistic predictions by penalizing low probability assigned to the correct class. It is the standard loss function for training probabilistic classifiers and serves as a fundamental evaluation metric.
- Mechanism: NLL is calculated as
-log(p(y_true | x)), wherepis the model's predicted probability for the true label. Perfect confidence yields an NLL of 0, while incorrect, high-confidence predictions are heavily penalized.
Expected Calibration Error (ECE)
Expected Calibration Error (ECE) is a scalar summary metric that quantifies miscalibration. It works by:
- Binning predictions into
Mintervals based on predicted confidence (e.g., [0.0, 0.1], [0.1, 0.2]). - Calculating the absolute difference between the average confidence and the empirical accuracy within each bin.
- Averaging these differences, weighted by the number of samples in each bin. A lower ECE indicates better calibration, with 0 representing perfect alignment.
Post-Hoc Calibration
Post-hoc calibration refers to techniques applied to a trained model's outputs—without retraining—to improve probability calibration. A separate calibration set is used to fit a simple mapping function. Common methods include:
- Temperature Scaling: Applies a single scalar to soften or sharpen logits.
- Platt Scaling: Fits a logistic regression model to the outputs of a binary classifier.
- Isotonic Regression: Fits a non-parametric, piecewise constant function. These methods correct for the overconfidence common in modern neural networks.
Conformal Prediction
Conformal prediction is a distribution-free framework for generating statistically valid prediction sets (for classification) or intervals (for regression) with guaranteed coverage probability. It provides rigorous uncertainty quantification for any underlying model.
- Process: Uses a calibration set to calculate a conformity score, then determines a threshold to create prediction sets that will contain the true label with a user-specified probability (e.g., 90%). It is a powerful method for achieving risk-controlling model outputs.
Calibration-Aware Training
Calibration-aware training integrates calibration objectives directly into the model training process, aiming to produce intrinsically well-calibrated models. This contrasts with post-hoc methods. Techniques include:
- Label Smoothing: Replaces hard 0/1 labels with smoothed targets (e.g., 0.9/0.1), preventing overconfidence.
- Focal Loss: Down-weights loss for well-classified examples, indirectly mitigating overconfidence on easy samples.
- Bayesian Methods: Treats model parameters as distributions, naturally capturing predictive uncertainty.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us