Label smoothing is a regularization technique that replaces hard, one-hot encoded target labels with a weighted mixture of the original label and a uniform distribution over all classes. This prevents the model from becoming overconfident by penalizing extremely large logits for the correct class, which reduces overfitting and improves model calibration. It is commonly applied to the cross-entropy loss function and is a form of output distribution regularization that discourages the model from assigning a probability of exactly 1.0 to any single class.
Glossary
Label Smoothing

What is Label Smoothing?
Label smoothing is a regularization technique used in machine learning, particularly for classification tasks, to improve model generalization and calibration by modifying the target labels during training.
The technique introduces a smoothing parameter, epsilon (ε), which controls the amount of smoothing. For a correct class, the target probability becomes (1 - ε), while the remaining probability mass of ε is distributed uniformly across all other classes. This acts as a form of confidence penalty, encouraging the model to learn more robust and generalizable features rather than memorizing the training data. It is closely related to concepts in confidence scoring and uncertainty quantification, as it inherently teaches the model to express a baseline level of doubt, making its predicted probabilities more reliable indicators of true likelihood.
Key Benefits of Label Smoothing
Label smoothing is a regularization technique that replaces hard, one-hot encoded target labels with a weighted mixture of the original label and a uniform distribution. This simple adjustment provides several key advantages for training more robust and reliable neural networks.
Improves Model Calibration
Label smoothing directly combats overconfidence, a common pathology where neural networks output extremely high probabilities (e.g., 0.99) for the predicted class, even when uncertain. By preventing the model from assigning all probability mass to a single label, smoothed labels encourage the network to output more calibrated probabilities. A calibrated model's predicted confidence score better reflects its true likelihood of being correct, which is critical for decision-making under uncertainty and downstream tasks like selective classification.
Enhances Generalization
By acting as a form of regularization, label smoothing reduces the model's tendency to overfit to the training data. Hard labels encourage the model to drive the logit for the correct class to infinity, which can make the decision boundaries overly sharp and sensitive to noise. The uniform noise from smoothing adds a penalty on overly large logits, promoting smoother decision boundaries. This often leads to better performance on out-of-distribution (OOD) data and improved test accuracy, especially in low-data regimes or with noisy labels.
Reduces Label Noise Sensitivity
Real-world datasets often contain incorrect labels (label noise). Training with hard one-hot labels on noisy data forces the model to memorize these errors, harming generalization. Label smoothing provides a noise-robust objective by implicitly telling the model that labels are not absolute. The technique is mathematically similar to adding uniform noise to the targets, which makes the model less likely to overfit to specific, potentially erroneous, training examples. This builds resilience into the learning process.
Mitigates the Logit Gap Problem
The logit gap is the difference between the logit of the correct class and the largest incorrect logit. Training with cross-entropy and hard targets can cause this gap to become excessively large. An oversized logit gap can make the model brittle and its gradients unstable. Label smoothing limits the maximum probability any class can receive, which in turn caps the maximum possible logit value. This results in better-behaved gradients and more stable optimization, particularly beneficial for training very deep networks or transformers.
Serves as a Prior for Human Ambiguity
For many tasks, especially in natural language processing and computer vision, the "correct" label is not always absolute. There can be legitimate ambiguity. For example, an image might be borderline between two classes, or a sentence could have multiple valid interpretations. Hard labels ignore this reality. Label smoothing introduces a uniform prior over all classes, which acts as a Bayesian prior acknowledging that other classes might be plausible. This better reflects the inherent uncertainty in many real-world labeling tasks.
Complements Other Confidence Techniques
Label smoothing is a foundational technique that works synergistically with other methods for confidence scoring and uncertainty quantification. It produces better-calibrated base probabilities, which improves the effectiveness of post-hoc methods like temperature scaling and Platt scaling. It also provides a more sensible starting point for Bayesian Neural Networks (BNNs) and Deep Ensembles. When used within a Retrieval-Augmented Generation (RAG) system, a well-calibrated generator model leads to more reliable composite confidence scores.
Label Smoothing vs. Hard Label Training
A technical comparison of the standard hard label training paradigm and the label smoothing regularization technique, focusing on their impact on model calibration, generalization, and training dynamics.
| Feature / Metric | Hard Label Training | Label Smoothing |
|---|---|---|
Target Label Distribution | One-hot encoded (e.g., [0, 0, 1, 0]) | Softened mixture (e.g., [0.01, 0.01, 0.97, 0.01] for ε=0.1) |
Primary Objective | Minimize cross-entropy loss for exact class match | Minimize cross-entropy loss against a softened target distribution |
Effect on Logits | Encourages logit for true class to approach +∞, others to -∞ | Penalizes excessively large logit gaps, preventing overconfident outputs |
Model Calibration | Often leads to overconfident, poorly calibrated predictions | Typically improves calibration, aligning confidence with accuracy |
Generalization | Can overfit to noisy labels and memorize training data | Acts as a regularizer, often improving test accuracy and robustness |
Smoothing Parameter (ε) | Not applicable (ε = 0) | Typically set between 0.05 and 0.2 (e.g., 0.1) |
Resilience to Label Noise | ||
Gradient Magnitude for True Class | Unbounded; can be very large for incorrect predictions | Bounded; prevents excessively large gradients |
Common Use Cases | Baseline training, tasks with extremely clean labels | Improving calibration, training with potentially noisy labels, distillation |
Practical Considerations & Implementation
Label smoothing is a regularization technique applied during training to improve model calibration and generalization. Its implementation involves specific mathematical adjustments and trade-offs.
Mathematical Formulation
Label smoothing replaces a one-hot encoded target vector (e.g., [0, 0, 1, 0]) with a weighted mixture. For a classification task with K classes, the smoothed label y_smooth for the true class c is:
y_smooth = (1 - ε) * y_onehot + (ε / K) * 1
Where ε (epsilon) is the smoothing hyperparameter, typically a small value like 0.1. This distributes a small probability mass uniformly across all classes, preventing the model from becoming overconfident by pushing logits to extreme values.
Hyperparameter Tuning (ε)
The smoothing strength ε is the critical hyperparameter.
- Typical Range: 0.05 to 0.2. A common default is 0.1.
- Effects:
- Low ε (e.g., 0.01): Minimal regularization; model may still become overconfident.
- High ε (e.g., 0.3): Can over-regularize, making the model too uncertain and potentially harming discriminative power.
- Tuning Strategy: Treat ε like other regularization parameters (e.g., dropout rate). Perform a grid search on a validation set, monitoring both accuracy and calibration metrics like Expected Calibration Error (ECE).
Impact on Loss Function & Logits
Label smoothing directly modifies the cross-entropy loss. With hard labels, the loss incentivizes infinite logits for the correct class. Smoothing changes this objective:
- The model is penalized less for high logits on the correct class and slightly penalized for very low logits on incorrect classes.
- This results in bounded logits and a softer softmax output distribution.
- Practical Effect: The model's learned representations often become more compact and better separated in the latent space, as it's not forced to maximize margins excessively.
When to Use It
Label smoothing is particularly beneficial in scenarios involving:
- Noisy Labels: Training datasets with annotation errors or ambiguity. Smoothing acts as a regularizer against overfitting to incorrect labels.
- Teacher Models in Knowledge Distillation: A smoothed teacher produces softer targets, which often contain more information (dark knowledge) than hard labels, leading to a better student model.
- Models Prone to Overconfidence: Large neural networks, especially those trained with cross-entropy without other strong regularizers, frequently exhibit poor calibration. Smoothing is a simple, effective countermeasure.
- Contrast with Early Stopping: It can sometimes reduce the need for very early stopping, as it directly tempers the training objective.
Trade-offs and Limitations
While useful, label smoothing is not a universal solution.
- Potential for Underfitting: Excessive smoothing (high ε) can limit peak performance (top-1 accuracy) on clean datasets by discouraging the model from becoming sufficiently discriminative.
- Not a Substitute for Data Quality: It helps with label noise but cannot compensate for fundamentally poor or non-representative data.
- Interaction with Other Techniques: Its effect can be complementary or redundant with other regularizers like dropout, weight decay, or data augmentation. Their combined strength may need adjustment.
- Task Specificity: Most beneficial for closed-set classification. Its utility for regression, dense prediction (e.g., segmentation), or language modeling is less standard and requires careful adaptation.
Implementation in Code
Implementing label smoothing is straightforward in frameworks like PyTorch or TensorFlow. The core step is modifying the target labels before computing the loss.
PyTorch Example:
pythonimport torch import torch.nn.functional as F def smooth_one_hot(labels, classes, epsilon=0.1): """ Converts hard labels to smoothed one-hot vectors. """ device = labels.device smooth_labels = torch.full((labels.size(0), classes), epsilon/(classes-1), device=device) smooth_labels.scatter_(1, labels.unsqueeze(1), 1.0 - epsilon) return smooth_labels # Usage during training loop logits = model(inputs) # Model outputs smoothed_targets = smooth_one_hot(targets, num_classes=10, epsilon=0.1) loss = F.kl_div(F.log_softmax(logits, dim=1), smoothed_targets, reduction='batchmean') # Or use cross_entropy with smoothed targets loss = F.cross_entropy(logits, smoothed_targets)
Key Point: Ensure the loss function is compatible with soft targets; standard cross_entropy in PyTorch expects class indices, not distributions, for the target. Use KLDivLoss or cross_entropy with distribution targets as shown.
Frequently Asked Questions
Label smoothing is a regularization technique used primarily in classification tasks to prevent models from becoming overconfident. It works by softening the hard, one-hot encoded target labels, which can improve model calibration and generalization.
Label smoothing is a regularization technique that replaces hard, one-hot encoded target labels with a weighted mixture of the original label and a uniform distribution over all classes. Instead of assigning a probability of 1.0 to the correct class and 0.0 to all others, it assigns a high probability (e.g., 0.9) to the correct class and distributes the remaining small probability mass (e.g., 0.1) evenly across all other classes. This prevents the model from becoming overconfident by discouraging it from pushing logits to extreme values, which reduces overfitting and often improves calibration error and generalization to unseen data.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Label smoothing is a foundational technique within the broader field of model confidence and uncertainty quantification. The following terms are essential for understanding its purpose, its alternatives, and the metrics used to evaluate its effectiveness.
Calibration Error
Calibration error measures the discrepancy between a model's predicted confidence scores and its actual empirical accuracy. A perfectly calibrated model's confidence of 90% should correspond to being correct 90% of the time. Label smoothing directly targets calibration by preventing models from becoming overconfident (assigning probabilities near 1.0) on training data, which often leads to poor calibration on unseen data. The primary metrics are:
- Expected Calibration Error (ECE): A scalar summary statistic calculated by binning predictions by confidence and averaging the absolute difference between average confidence and accuracy within each bin.
- Reliability Diagram: The visual plot upon which ECE is based, showing binned confidence vs. observed accuracy.
Negative Log-Likelihood (NLL)
Negative Log-Likelihood (NLL), or log loss, is the standard proper scoring rule used to train most classification models. It is defined as -log(p(y_true)), where p(y_true) is the probability the model assigns to the correct class. This function heavily penalizes confident but incorrect predictions. Label smoothing modifies the NLL objective by changing the target y_true from a hard 1 to a slightly softer value (e.g., 0.9), which in turn discourages the model from driving the logits for the correct class to extreme values, acting as a form of regularization against overfitting to noisy labels.
Temperature Scaling
Temperature scaling is a post-hoc calibration method applied after a model is trained. It uses a single learned scalar parameter T (the "temperature") to soften the model's output softmax distribution: softmax(logits / T). Unlike label smoothing, which is applied during training, temperature scaling is a lightweight fix applied during inference.
- Comparison: Both techniques "soften" outputs. Label smoothing does so by altering the training target, while temperature scaling adjusts the inference-time activation. Temperature scaling is often used to calibrate models already trained with or without label smoothing.
Selective Classification
Selective classification, or classification with a rejection option, is a paradigm where a model is allowed to abstain from making a prediction when its confidence is below a predefined threshold. This creates a critical trade-off between coverage (the fraction of samples predicted on) and risk (the error rate on those predictions).
- Connection to Label Smoothing: Well-calibrated confidence scores, which label smoothing helps achieve, are a prerequisite for effective selective classification. If a model is overconfident, its confidence scores cannot reliably be used to decide when to abstain.
Knowledge Distillation
Knowledge distillation is a model compression technique where a small "student" model is trained to mimic the soft probability outputs of a larger, more accurate "teacher" model. The key insight is that the teacher's softened class probabilities (e.g., a cat is 0.9 cat, 0.05 dog, 0.05 horse) contain more information than hard labels.
- Relationship to Label Smoothing: Label smoothing can be seen as a primitive form of self-distillation, where the model is trained against a softened version of its own (implicit) one-hot targets. Both techniques leverage soft targets as a richer training signal to improve generalization.
Out-of-Distribution (OOD) Detection
Out-of-distribution detection is the task of identifying inputs that are statistically different from the training data distribution. A critical failure mode is when models make overconfident predictions on OOD data.
- Role of Label Smoothing: By regularizing the model away from extreme confidence on in-distribution data, label smoothing can indirectly improve OOD detection. A model trained with label smoothing is less likely to assign a near-1.0 probability to an OOD sample, making its high confidence slightly more meaningful and leaving room for other OOD detection methods (like maximum softmax probability) to be more effective.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us