Glossary

Mixed-Precision Training

Mixed-precision training is a neural network training technique that uses lower-precision data types for most operations to accelerate computation and reduce memory usage, while maintaining higher precision for critical operations to preserve numerical stability and model accuracy.

Get in touch Learn more

Data scientist building training data pipeline on laptop, data preprocessing visible, technical workspace.

MODEL COMPRESSION TECHNIQUES

What is Mixed-Precision Training?

Mixed-precision training is a computational optimization technique that uses multiple numerical precisions during neural network training to accelerate computation and reduce memory usage.

Mixed-precision training is a technique that uses lower-precision data types like 16-bit floating-point (FP16 or BFloat16) for most tensor operations during neural network training to dramatically speed up computation and reduce memory usage, while maintaining certain critical operations—such as weight updates, loss scaling, and master weight storage—in higher 32-bit precision (FP32) to preserve numerical stability and final model accuracy. This hybrid approach, often automated by frameworks like NVIDIA's Automatic Mixed Precision, allows modern hardware like GPUs and TPUs to perform more operations per second and fit larger models or batch sizes into available memory.

The technique's core mechanism involves a loss scaling step to prevent gradient underflow in FP16, where small gradient values are multiplied by a factor before conversion, then unscaled after the backward pass. For TinyML deployment, mixed-precision principles are foundational for post-training quantization and quantization-aware training, where models are trained with simulated quantization to produce weights optimized for efficient INT8 inference on microcontrollers. This bridges the gap between training-time efficiency and the extreme precision reduction required for execution on memory-constrained edge devices.

MIXED-PRECISION TRAINING

Key Components & Data Types

Mixed-precision training accelerates neural network training by strategically using lower-precision numerical formats for most operations while maintaining higher precision for critical steps to ensure stability.

FP32 (Single Precision)

The standard 32-bit floating-point format, used as the baseline for neural network training. It provides a wide dynamic range and high numerical precision, crucial for maintaining stable gradient updates and accumulating small weight changes.

Primary Role: Master copy of weights, loss scaling, and critical accumulation operations.
Bit Layout: 1 sign bit, 8 exponent bits, 23 fraction bits.
Dynamic Range: Approximately ±1.18e-38 to ±3.4e38.
Key Use Case: Storing the master weights that are updated by the FP16 gradient, preventing underflow.

FP16 (Half Precision)

A 16-bit floating-point format that halves the memory footprint and bandwidth requirements compared to FP32, enabling faster computation on modern hardware like NVIDIA Tensor Cores.

Primary Role: Forward pass, backward pass, and storing activations.
Bit Layout: 1 sign bit, 5 exponent bits, 10 fraction bits.
Dynamic Range: Approximately ±5.96e-8 to ±65504.
Key Challenge: Limited range can cause gradient underflow, where small gradient values become zero, halting learning.

BFloat16 (Brain Float)

A 16-bit format designed by Google that uses the same 8-bit exponent as FP32 but truncates the mantissa to 7 bits. This preserves the dynamic range of FP32, making it more robust for deep learning than FP16.

Primary Role: Preferred alternative to FP16 in many modern frameworks (e.g., PyTorch, TensorFlow).
Bit Layout: 1 sign bit, 8 exponent bits, 7 fraction bits.
Dynamic Range: Matches FP32 (±~1e-38 to ±~3e38).
Advantage: Much lower risk of overflow/underflow compared to FP16, often eliminating the need for loss scaling.

Loss Scaling

A critical technique to prevent gradient underflow in FP16 training. The loss value is multiplied by a large scaling factor (e.g., 128, 1024) before backpropagation, shifting the tiny FP16 gradients into a representable range.

Process: Scaled Loss = Loss * Scale Factor. Gradients are automatically scaled by the same factor via the chain rule.
Weight Update: Scaled gradients are used to update the master FP32 weights, then un-scaled.
Dynamic Scaling: Algorithms like NVIDIA's APEX AMP automatically adjust the scale factor up or down based on gradient norms to prevent overflow.

Master Weights

A full-precision (FP32) copy of the model's parameters maintained during mixed-precision training. The FP16 weights used for computation are a cast-down copy of these master weights.

Purpose: Provides a high-precision accumulator for weight updates. Small gradient updates, which may be lost in FP16, are preserved in FP32.
Update Cycle:
1. Forward/backward passes use FP16 weights.
2. Gradients are calculated in FP16 (and scaled).
3. Gradients are used to update the FP32 master weights.
4. Updated master weights are cast back to FP16 for the next iteration.
Benefit: Ensures no long-term information is lost due to low-precision rounding errors.

Tensor Cores & Hardware

Specialized processing units on modern GPUs (e.g., NVIDIA's Volta architecture and later) that perform matrix operations much faster in mixed precision. They are the primary hardware driver for the speedup of mixed-precision training.

Operation: Perform matrix multiply-accumulate operations in the form D = A * B + C, where A and B are FP16 matrices, and C and D can be FP16 or FP32.
Speedup: Can provide up to 8x theoretical throughput for matrix operations compared to FP32 on standard CUDA cores.
Software Integration: Accessed via frameworks like PyTorch (torch.cuda.amp) and TensorFlow, which automatically cast operations to use Tensor Cores where possible.

TECHNIQUE

How Mixed-Precision Training Works

Mixed-precision training is a computational strategy that accelerates neural network training and reduces memory consumption by using multiple numerical precisions.

Mixed-precision training is a technique that uses lower-precision data types, primarily 16-bit floating-point (FP16 or BFloat16), for most tensor operations during neural network training to gain speed and memory efficiency. It maintains numerical stability by keeping a master copy of weights in 32-bit floating-point (FP32) and using loss scaling to prevent gradient underflow. This approach, enabled by modern GPU tensor cores, can nearly double training throughput and halve memory usage compared to standard FP32 training.

The technique operates through a defined workflow: forward and backward passes calculate using FP16 weights and activations, generating FP16 gradients. A loss scaling factor is applied to these gradients before they are used to update the master FP32 weight copy, preserving small gradient values. The updated master weights are then cast back to FP16 for the next iteration. This careful management of precision prevents the vanishing gradient problem inherent to low-precision math while maximizing hardware utilization, making it foundational for training large models.

NUMERICAL REPRESENTATION

Precision Formats: FP32 vs. FP16 vs. BFloat16

Comparison of floating-point formats used in mixed-precision training for neural networks, detailing their bit-level structure, numerical range, and suitability for training and inference.

Feature	FP32 (Single Precision)	FP16 (Half Precision)	BFloat16 (Brain Float)
Total Bits	32 bits	16 bits	16 bits
Sign Bits	1 bit	1 bit	1 bit
Exponent Bits	8 bits	5 bits	8 bits
Mantissa (Significand) Bits	23 bits	10 bits	7 bits
Dynamic Range (approx.)	1.2e-38 to 3.4e+38	5.96e-8 to 6.55e+4	1.2e-38 to 3.4e+38
Smallest Positive Normalized	1.175494e-38	6.103516e-5	1.175494e-38
Primary Use Case	Baseline training, critical ops	Mixed-precision training, inference	Mixed-precision training, inference
Memory Footprint (vs. FP32)	100% (Baseline)	50%	50%
Exponent Range Match to FP32	N/A (Baseline)
Gradient Underflow Risk	Very Low	High (requires loss scaling)	Low (similar to FP32)
Hardware Support (Modern GPUs/TPUs)
Typical Inference Target	Server/Cloud	Edge/Embedded	Edge/Embedded, Cloud

MIXED-PRECISION TRAINING

Framework Implementation & Tooling

Mixed-precision training is implemented via specialized software libraries and hardware support to accelerate neural network training while managing numerical stability. This section details the key frameworks, data types, and optimization techniques that enable this critical performance enhancement.

Automatic Mixed Precision (AMP)

Automatic Mixed Precision (AMP) is a feature in deep learning frameworks like PyTorch and TensorFlow that automates the use of mixed-precision training. It handles the casting of tensors between precision formats (e.g., FP32 and FP16) and the scaling of loss gradients to prevent underflow.

PyTorch AMP: Uses torch.cuda.amp with GradScaler to manage gradient scaling and automatic casting.
TensorFlow AMP: Enabled via tf.keras.mixed_precision.Policy, which sets a global dtype policy for layers and optimizers.
Core Function: Automatically selects which operations run in FP16 for speed and which remain in FP32 for stability, such as softmax or layer normalization.

EXPLORE

BFloat16 & FP16 Data Types

The choice of lower-precision format is central to mixed-precision training. FP16 (float16) and BFloat16 (Brain Floating Point) are the two primary 16-bit formats.

FP16: Uses a 5-bit exponent and 10-bit mantissa. Offers a smaller dynamic range (~5e-4 to 65504), making it susceptible to underflow/overflow. Requires careful gradient scaling.
BFloat16: Uses an 8-bit exponent (matching FP32) and a 7-bit mantissa. Preserves the dynamic range of FP32, making it more robust for training without extensive scaling. It is the preferred format on modern AI accelerators like Google TPUs and NVIDIA A100+ GPUs.
Hardware Support: Modern GPUs (Volta architecture and later) and NPUs provide dedicated tensor cores that perform matrix operations much faster in these lower precisions.

Loss Scaling

Loss scaling is a mandatory technique when using FP16 to prevent gradient underflow. Gradients for small-magnitude weights can fall below the minimum representable value of FP16 (≈5.96e-8), becoming zero and halting learning.

Mechanism: The loss value is multiplied by a scaling factor (e.g., 1024) before backpropagation. This shifts gradients into the FP16 representable range.
Backward Pass: The scaled gradients flow backward through the network.
Optimizer Step: Before the optimizer updates the weights, the gradients are unscaled (divided by the same factor) to correct the magnitude.
Dynamic Scaling: Frameworks like AMP dynamically adjust the scale factor throughout training to find an optimal value.

Framework Integration & APIs

Mixed-precision training is deeply integrated into major ML frameworks, providing both automatic and manual control APIs.

PyTorch: The torch.autocast context manager and GradScaler are the core APIs. autocast regions automatically cast operations to FP16 where safe.
TensorFlow: The mixed_precision API allows setting a global policy ('mixed_float16') and offers LossScaleOptimizer to wrap standard optimizers.
JAX: Uses the jax.experimental.enable_x64 and jax.config.update('jax_default_matmul_precision', 'bfloat16') for control, often in conjunction with libraries like Flax or Haiku.
Manual Casting: Advanced users can manually cast specific model sections (e.g., embedding tables) to higher precision while keeping others in lower precision.

EXPLORE

Master Weights & Optimizer State

To maintain accuracy, mixed-precision training often keeps a copy of weights in full FP32 precision, known as master weights.

Purpose: All weight updates are performed in FP32 to preserve small gradient contributions, then the FP32 master weight is copied down to FP16/BFloat16 for the forward pass.
Optimizer State: Optimizer momentum and variance terms (e.g., for Adam) are also typically stored in FP32, doubling the memory required for the optimizer state compared to pure FP16 training.
Memory Trade-off: While activations and gradients use lower precision, the master weights and optimizer state can become the memory bottleneck for very large models, leading to techniques like ZeRO (Zero Redundancy Optimizer).

Compiler & Hardware-Level Support

Ultimate performance gains are realized through compiler optimizations and dedicated hardware.

NVIDIA Tensor Cores: Specialized units in GPUs that perform matrix multiply-accumulate operations on FP16/BFloat16 inputs, delivering up to 16x higher throughput compared to FP32 on standard CUDA cores.
Compiler Passes: Frameworks like XLA (Accelerated Linear Algebra) for TensorFlow/JAX and Torch-TensorRT fuse operations and generate hardware-specific kernels that maximize the use of tensor cores.
Kernel Fusion: Compilers fuse consecutive operations (e.g., convolution, bias add, ReLU) into a single kernel, reducing memory transfers and keeping data in fast, lower-precision registers.
CPU/ARM Support: Libraries like oneDNN enable mixed-precision inference and training on CPUs using AVX-512 and ARM's SVE instructions.

MIXED-PRECISION TRAINING

Frequently Asked Questions

Mixed-precision training is a critical technique for training large neural networks efficiently. These questions address its core mechanisms, benefits, and practical implementation details for developers and engineers.

Mixed-precision training is a computational technique that uses lower-precision numerical formats (like FP16 or BFloat16) for most operations during neural network training to accelerate computation and reduce memory usage, while maintaining certain critical operations in higher precision (like FP32) to preserve model stability and final accuracy. It works by performing forward and backward passes with FP16 tensors, which halves memory bandwidth and can leverage specialized hardware like NVIDIA Tensor Cores for faster matrix multiplications. A master copy of the weights is kept in FP32 to accumulate small gradient updates accurately. After each backward pass, gradients are used to update the FP32 master weights, which are then copied back (cast down) to FP16 for the next iteration. An optional loss scaling step is applied to prevent gradient values in FP16 from underflowing to zero.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

PRECISION & COMPRESSION

Related Terms

Mixed-precision training intersects with several core techniques for optimizing neural networks. These related concepts focus on managing numerical formats, computational efficiency, and memory usage throughout the model lifecycle.

Quantization

Quantization is a model compression technique that reduces the numerical precision of a neural network's weights and activations. It converts parameters from high-precision floating-point formats (like 32-bit FP32) to lower-precision integers (like 8-bit INT8) or floats (like FP16). This shrinks model size, reduces memory bandwidth, and can accelerate inference, especially on hardware with optimized integer units. It is a critical downstream step for deploying models trained with mixed precision.

BFloat16

BFloat16 (Brain Floating Point 16) is a 16-bit numeric format designed for machine learning. Its key feature is preserving the 8-bit exponent from FP32, matching its dynamic range, while truncating the mantissa. This makes it highly suitable for mixed-precision training, as gradients remain stable without the risk of underflow or overflow common with standard FP16. It is natively supported on modern AI accelerators (e.g., TPUs, NVIDIA A100+ GPUs) for efficient computation.

Quantization-Aware Training (QAT)

Quantization-Aware Training is a technique where quantization error is simulated during the training process. The forward pass uses fake-quantized weights and activations (often in INT8), but backward passes and weight updates occur in full precision (FP32). This allows the model to adapt its parameters to the expected precision loss, resulting in higher accuracy when the model is later converted for integer-only inference. It bridges the gap between mixed-precision training and highly efficient low-precision deployment.

Model Compression

Model compression is an umbrella term for techniques that reduce a neural network's computational footprint and memory requirements. Key methods include:

Quantization: Reducing numerical precision.
Pruning: Removing redundant weights or neurons.
Knowledge Distillation: Training a smaller student model to mimic a larger teacher. Mixed-precision training is often a precursor, enabling the training of large models that are later compressed for deployment on resource-constrained devices like microcontrollers.

INT8 Inference

INT8 inference is the execution of a neural network using 8-bit integer arithmetic for both weights and activations. It is a common target for post-training quantization of models initially trained with mixed or full precision. INT8 provides a 4x reduction in model size and memory bandwidth compared to FP32, and significant speedups on hardware with optimized integer vector units. Achieving accurate INT8 inference often relies on the numerical stability provided by higher-precision training phases.

Loss Scaling

Loss scaling is a critical technique used in FP16/BFloat16 mixed-precision training to prevent gradient underflow. Since gradient values can fall into the subnormal range of low-precision formats, they may become zero. To mitigate this, the loss value is multiplied by a scale factor (e.g., 128, 1024) before backpropagation. This shifts gradients into a representable range. The scaled gradients are then used for weight updates, and the optimizer's master weights (in FP32) are unscaled, preserving numerical stability.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Mixed-Precision Training

What is Mixed-Precision Training?

Key Components & Data Types

FP32 (Single Precision)

FP16 (Half Precision)

BFloat16 (Brain Float)

Loss Scaling

Master Weights

Tensor Cores & Hardware

How Mixed-Precision Training Works

Precision Formats: FP32 vs. FP16 vs. BFloat16

Framework Implementation & Tooling

Automatic Mixed Precision (AMP)

BFloat16 & FP16 Data Types

Loss Scaling

Framework Integration & APIs

Master Weights & Optimizer State

Compiler & Hardware-Level Support

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there