Numerical stability is the property of an algorithm to produce results that are not disproportionately sensitive to small perturbations in its input data or to the rounding errors inherent in finite-precision arithmetic. In machine learning, particularly during mixed precision inference, an unstable computation can cause underflow (values becoming zero), overflow (values exceeding the maximum representable number), or catastrophic accumulation of rounding error, leading to invalid outputs like NaN (Not a Number) or Inf (infinity).
Glossary
Numerical Stability

What is Numerical Stability?
Numerical stability is a foundational concept in computational mathematics and machine learning, critical for ensuring reliable results when performing calculations with finite precision.
Maintaining stability is paramount when using reduced formats like FP16 or INT8, which have limited dynamic range. Techniques such as loss scaling for gradients, careful initialization, and the use of BFloat16 (which preserves exponent range) are employed to mitigate risks. The goal is to achieve the performance benefits of lower precision—reduced memory bandwidth and faster computation—without introducing numerical instability that degrades model accuracy or causes runtime failures.
Key Threats to Numerical Stability
Numerical stability in mixed precision computing refers to the avoidance of problematic conditions that can degrade or invalidate model outputs when using reduced precision formats like FP16, BF16, or INT8.
Underflow
Underflow occurs when a computation produces a non-zero result that is smaller than the smallest positive number representable in the chosen floating-point format, causing it to be rounded to zero. This is a critical threat in mixed precision inference, especially when using FP16, which has a limited dynamic range.
- Primary Risk: Gradients or activation values in deep networks can vanish, halting learning or causing dead neurons during training. In inference, it can lead to a complete loss of signal in certain layers.
- Example: In FP16, the smallest positive normalized number is approximately 6.0e-8. A value like 1.0e-9 would underflow to zero.
- Mitigation: Using formats with a larger exponent range (like BF16), applying loss scaling during training, or strategically casting sensitive operations to higher precision.
Overflow
Overflow happens when a computation yields a result larger than the maximum finite number representable in the format, causing it to be replaced with infinity (Inf) or the maximum value. This is destructive, as subsequent operations with Inf propagate errors.
- Primary Risk: Exploding gradients during training or nonsensical, saturating activation values (like
NaN) during inference, which can corrupt an entire batch. - Example: The maximum finite value in FP16 is ~65504. A simple matrix multiplication in a large model can easily produce intermediate values exceeding this limit.
- Mitigation: Normalization techniques (e.g., LayerNorm), gradient clipping, using BF16 (which matches FP32's exponent range), or implementing careful numerical scaling in sensitive operations like softmax.
Catastrophic Cancellation
Catastrophic cancellation is the severe loss of significant digits that occurs when subtracting two nearly equal floating-point numbers. The result has high relative error, amplifying any small pre-existing errors in the operands.
- Primary Risk: Pervasive in statistical computations, variance calculations, and some activation functions. In mixed precision, the reduced significand (mantissa) bits of formats like FP16 exacerbate the problem.
- Example: Calculating variance as
E[x^2] - (E[x])^2can lead to catastrophic cancellation if the two terms are very close. - Mitigation: Using numerically stable algorithms (e.g., Welford's algorithm for variance), performing sensitive subtraction operations in higher precision (FP32), and restructuring mathematical expressions to avoid direct subtraction of large, similar numbers.
Excessive Rounding Error
Excessive rounding error is the accumulation of small inaccuracies introduced every time a real number is rounded to fit a finite-precision floating-point or integer format. In low-precision inference, each operation contributes more error.
- Primary Risk: Non-linear error propagation through deep networks, leading to drifted outputs and degraded accuracy. This is the fundamental source of quantization error.
- Mechanism: Rounding can be biased (e.g., truncation) or unbiased (e.g., round-to-nearest). The reduced mantissa bits in FP16/BF16 or the limited integer range in INT8 increase the magnitude of each rounding step.
- Mitigation: Using stochastic rounding during training (less common in inference), per-channel quantization for finer granularity, and quantization-aware training (QAT) to allow the model to learn robust representations that tolerate rounding.
Ill-Conditioned Problems
An ill-conditioned problem is one where a small change in the input leads to a large change in the output. When solved with finite-precision arithmetic, these problems are highly sensitive to rounding errors, making them unstable.
- Primary Risk: Common in linear algebra operations fundamental to neural networks, such as matrix inversion or solving linear systems. Low precision dramatically amplifies this inherent sensitivity.
- Example: Computing the inverse of a matrix with a high condition number. The result in FP16 can be wildly inaccurate compared to the FP32 result.
- Mitigation: Preconditioning the problem (transforming it to a better-conditioned form), using double-precision (FP64) for critical, small linear algebra subroutines, and employing iterative refinement techniques that use higher precision to correct low-precision results.
Non-Associativity of Floating-Point Arithmetic
Floating-point addition and multiplication are not associative due to rounding. The order of operations changes the result: (a + b) + c ≠ a + (b + c). This breaks a fundamental assumption of parallel mathematics.
-
Primary Risk: Causes non-determinism and reproducibility issues in parallel computing. Reductions (sums) over large tensors—common in dot products, batch normalization, and loss calculation—can produce different results depending on thread scheduling or hardware.
-
Impact in Mixed Precision: The effect is more pronounced in lower precision formats where rounding is more aggressive.
-
Mitigation: Using deterministic algorithms for reductions (often at a performance cost), employing higher-precision accumulators for critical sums (e.g., using FP32 to accumulate FP16 products), and implementing reproducible reduction patterns.
How Numerical Stability is Maintained
Numerical stability in mixed precision computing refers to the avoidance of problematic conditions like underflow, overflow, or excessive rounding error that can degrade or invalidate model outputs when using reduced precision formats.
Numerical stability is maintained through a combination of algorithmic techniques and hardware-aware software design. Core methods include loss scaling to prevent gradient underflow during training, careful selection of formats like BFloat16 (BF16) that preserve exponent range, and the use of master weights in FP32 during optimization. Software frameworks implement automatic mixed precision (AMP) to manage precision casting and scaling dynamically, preventing catastrophic value collapse in sensitive operations like softmax or layer normalization.
Stability is further ensured via quantization-aware training (QAT), which simulates precision loss during learning, and strategic model casting for critical operations. Dynamic range analysis during calibration for static quantization prevents clipping of outlier activations. At the system level, operator fusion and the use of hardware-supported symmetric quantization for integer math reduce cumulative rounding errors, ensuring deterministic outputs despite reduced bit-width.
Numerical Format Comparison & Stability Profile
A comparison of common numerical formats used in mixed precision inference, highlighting their bit allocation, dynamic range, and primary stability risks.
| Feature / Metric | FP32 (Full) | BF16 / FP16 (Half) | INT8 (Quantized) |
|---|---|---|---|
Total Bits | 32 | 16 | 8 |
Exponent Bits | 8 | 8 (BF16) / 5 (FP16) | N/A |
Mantissa (Significand) Bits | 23 | 7 (BF16) / 10 (FP16) | N/A |
Dynamic Range (approx. base 10) | 1e-38 to 3e38 | BF16: ~1e-38 to 3e38 FP16: 6e-5 to 6e4 | Determined by Scale Factor |
Primary Stability Risk | Minimal. High precision reduces rounding error. | BF16: Low risk of overflow, higher rounding error. FP16: High risk of underflow/overflow (gradient vanishing/exploding). | High risk of clipping and accumulated quantization error distorting outputs. |
Typical Use Case | Baseline training & high-precision reference. | BF16: Training & inference on modern accelerators (e.g., TPUs, Ampere+ GPUs). FP16: Inference where dynamic range is managed. | Production inference on integer-optimized hardware (CPU/GPU/Edge TPU). |
Memory Footprint (vs. FP32) | 1x (Baseline) | 0.5x | 0.25x |
Hardware Throughput (Relative) | 1x | 2-8x on supported Tensor/Matrix Cores | 2-4x on integer ALUs |
Requires Calibration | |||
Common Stability Mitigation | N/A | Loss scaling (FP16), Master weights in FP32 (BF16/FP16 training). | Per-channel quantization, fine-tuning (QAT), careful calibration dataset selection. |
Frequently Asked Questions
Numerical stability is a foundational concern in mixed precision computing, ensuring that reduced precision arithmetic does not lead to catastrophic failures like underflow, overflow, or excessive rounding error that invalidate model outputs.
Numerical stability refers to the property of an algorithm where small changes or errors in input data or intermediate computations do not cause disproportionately large or catastrophic deviations in the final output. In the context of mixed precision inference, it specifically concerns the avoidance of problematic conditions—such as underflow, overflow, catastrophic cancellation, and excessive rounding error—when using reduced precision formats like FP16 or INT8. An unstable computation can cause a model to produce NaN (Not a Number) or inf (infinity) values, or simply degrade accuracy to unusable levels, even if the high-precision version of the model is perfectly valid. Ensuring stability is a core engineering challenge when deploying models with quantization and mixed precision to achieve latency and cost benefits.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Numerical stability is a foundational concern in mixed precision computing, interacting with several key techniques and hardware features. These related concepts define the ecosystem for managing precision, error, and performance.
Quantization Error
Quantization error is the numerical discrepancy introduced when converting a value from a higher to a lower precision format. It is the primary source of accuracy degradation in reduced-precision inference.
- Sources: Rounding error and clipping error when values fall outside the representable range of the quantized format.
- Impact: Errors can accumulate through successive layers, potentially leading to significant output divergence. Managing this error is the core challenge of quantization techniques like QAT and per-channel quantization.
Loss Scaling
Loss scaling is a critical technique for maintaining numerical stability during mixed precision training. It prevents gradient underflow in FP16.
- Mechanism: The loss value is multiplied by a scale factor (e.g., 1024) before backpropagation. This shifts tiny gradient values into a representable range for FP16.
- Process: Gradients are unscaled before the optimizer updates the weights. Frameworks like PyTorch AMP automate dynamic loss scaling by monitoring gradients for overflow.
Underflow & Overflow
Underflow and overflow are catastrophic numerical instability events caused by exceeding the dynamic range of a floating-point format.
- Underflow: Occurs when a computation produces a non-zero result smaller than the smallest positive normalized number the format can represent (e.g., ~5.96e-8 for FP16). The value may flush to zero (gradual underflow), destroying gradient information.
- Overflow: Occurs when a result exceeds the format's maximum finite value (e.g., 65504 for FP16), resulting in positive or negative infinity, which propagates and corrupts all subsequent computations.
BFloat16 (BF16)
BFloat16 (BF16) is a 16-bit floating-point format explicitly designed for numerical stability in deep learning. It addresses the limited range of standard FP16.
- Key Feature: Uses an 8-bit exponent (same as FP32) but a truncated 7-bit mantissa. This preserves the dynamic range of FP32, drastically reducing the risk of overflow/underflow compared to FP16.
- Trade-off: The reduced mantissa precision increases rounding error, but neural networks are generally more tolerant to this than to range limitations. It is widely supported on modern AI accelerators (TPUs, NVIDIA Ampere+ GPUs).
Automatic Mixed Precision (AMP)
Automatic Mixed Precision (AMP) is a software-level automation that manages numerical stability while enabling performance gains. It abstracts the complexity of manual precision casting and loss scaling.
- Function: It automatically selects operations to run in FP16/BF16 (for speed) and FP32 (for stability), particularly for reduction operations sensitive to precision.
- Stability Management: Integrates dynamic loss scaling and master weight maintenance in FP32 to ensure stable training. Tools like PyTorch's
torch.cuda.ampand TensorFlow'stf.keras.mixed_precisionimplement AMP.
Numerical Conditioning
Numerical conditioning refers to the sensitivity of a mathematical problem (or a neural network layer) to small perturbations in its input. Poor conditioning exacerbates quantization error.
- In Practice: Operations like matrix inversion or layers with large weight value ranges are ill-conditioned. Quantizing them can lead to disproportionately large output errors.
- Mitigation: Techniques like per-channel quantization, quantization-aware training (QAT), and layer normalization help improve the conditioning of a network for lower precision execution.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us