Glossary

Automatic Mixed Precision (AMP)

Automatic Mixed Precision (AMP) is a software feature that dynamically selects appropriate numerical precisions for different operations to accelerate neural network training and inference while maintaining numerical stability.

Get in touch Learn more

Data engineer managing feature store on laptop, feature definitions visible, casual data engineering session.

MIXED PRECISION INFERENCE

What is Automatic Mixed Precision (AMP)?

A software-level technique for accelerating neural network training and inference by automatically selecting optimal numerical precisions for different operations.

Automatic Mixed Precision (AMP) is a feature in deep learning frameworks like PyTorch (torch.cuda.amp) and TensorFlow (tf.keras.mixed_precision) that dynamically manages numerical precision to accelerate computation. It automatically casts certain operations to lower-precision formats like FP16 or BF16 to leverage faster hardware execution, while keeping other operations in higher precision like FP32 to preserve numerical stability. This reduces memory bandwidth and increases computational throughput without requiring manual type casting by the developer.

The core mechanism involves two components: a gradient scaler to prevent underflow during training and a policy-based casting system for inference. During inference, AMP analyzes the model graph to apply lower precision to compute-intensive operations (e.g., matrix multiplications in linear layers) while keeping sensitive operations (e.g., reductions, softmax) in higher precision. This automation provides a near-optimal latency-accuracy trade-off, directly reducing inference cost and latency on supported hardware like NVIDIA GPUs with Tensor Cores.

MIXED PRECISION INFERENCE

Core Mechanisms of AMP

Automatic Mixed Precision (AMP) is a software feature that dynamically selects numerical precisions for different operations to accelerate inference while managing numerical stability. Its core mechanisms automate the complex trade-offs between speed and accuracy.

Precision Casting & Operator Selection

AMP's primary mechanism is the automatic insertion of precision casting operations (e.g., float32 to float16) into the model's computational graph. It uses a predefined operator whitelist/blacklist to decide which operations are safe to run in lower precision (FP16/BF16) and which must remain in high precision (FP32) for stability.

Whitelisted Ops: Convolutions, matrix multiplications. These are compute-bound and benefit massively from the 2-8x throughput of Tensor Cores/Matrix Cores.
Blacklisted Ops: Reductions, exponentiation, logarithms. These are often numerically sensitive and stay in FP32.
Gray-listed Ops: Conditional, handled on a case-by-case basis.

The casting is performed automatically by the framework's autocast context manager, eliminating manual torch.cuda.amp.autocast() or tf.train.MixedPrecisionPolicy code.

Loss Scaling for Gradient Stability

During the fine-tuning or calibration phase of an AMP workflow, a critical mechanism prevents gradient underflow. When activations are in FP16, backpropagated gradients can become too small (below ~6e-8) and flush to zero.

AMP employs dynamic loss scaling:

The forward pass loss is multiplied by a scale factor (e.g., 2^16).
Gradients are scaled up proportionally, keeping them within FP16's representable range.
After the backward pass, gradients are unscaled before the optimizer step.
The system monitors for gradient overflow (inf/NaN). If overflow is detected, the optimizer step is skipped, and the scale factor is reduced for the next iteration.

This mechanism is essential for maintaining training stability when using FP16, making it a cornerstone of frameworks like PyTorch AMP and TensorFlow's mixed precision policies.

Master Weights in Optimizer State

To ensure convergence accuracy, AMP maintains a copy of model parameters in full FP32 precision, known as master weights. The mechanism works as follows:

Forward/Backward Pass: Conducted in FP16/BF16 for speed.
Optimizer State: The optimizer (e.g., Adam, SGD) stores and updates the FP32 master weights.
Weight Update: Gradients (unscaled after loss scaling) are applied to the master weights in FP32 precision, preserving update fidelity.
Copy Down: Before the next forward pass, the updated FP32 master weights are cast down to FP16/BF16 for the model's working weights.

This decoupling allows the compute-intensive forward/backward passes to leverage low-precision speed, while the critical weight update retains high-precision numerical stability. It is a key differentiator from simple, manual FP16 inference.

Hardware Kernel Dispatch & Tensor Cores

AMP's performance gains are realized through hardware-aware kernel dispatch. When AMP casts tensors to FP16/BF16, it enables the framework's backend (e.g., CUDA, cuDNN, oneDNN) to select highly optimized, low-precision kernels.

On NVIDIA GPUs with Tensor Cores and AMD GPUs with Matrix Cores, this triggers the use of specialized arithmetic units that perform mixed-precision matrix operations with drastically higher throughput:

FP16 matrix multiply with FP32 accumulate.
BF16 support on Ampere architecture and later.
INT8 via separate quantization workflows.

AMP automatically ensures data is formatted and aligned to meet the strict requirements of these hardware units, maximizing FLOPs utilization and reducing kernel launch overhead by favoring fused operations where possible.

Numerical Safety Guards & Promotions

AMP incorporates automatic numerical safety guards to prevent instability. These are rules that temporarily promote operations back to FP32 to avoid overflow, underflow, or excessive rounding error.

Common promotion triggers include:

Reduction operations (sum, mean) across large tensors, where FP16's limited range can overflow.
Normalization operations (LayerNorm, Softmax) where exponentiation can cause overflow in FP16.
Certain arithmetic sequences known to cause catastrophic precision loss.

These promotions are handled transparently by the framework's autocast logic. The system may also insert checkpointing casts to ensure intermediate values between promoted and non-promoted regions are correctly typed, maintaining the integrity of the computational graph.

Integration with Quantization & Calibration

For inference optimization, AMP often integrates with post-training quantization (PTQ) pipelines. The mechanism involves a calibration phase where AMP manages precision during data collection for quantization parameters.

Calibration Forward Pass: AMP runs in inference mode, using FP16/BF16 for most layers to speed up the calibration process.
Data Range Collection: Statistics (min/max) for activations are collected in their runtime precision (FP16/BF16) or are promoted to FP32 for accuracy, depending on the quantization scheme.
Smooth Transition to INT8: The calibrated model can then be quantized to INT8. AMP's precision casting graph serves as a blueprint for where quantize/dequantize (Q/DQ) nodes should be inserted in formats like TensorRT or ONNX Runtime.

This makes AMP a foundational tool for building multi-stage precision reduction pipelines, bridging pure FP32 models to highly optimized INT8 deployments.

MIXED PRECISION INFERENCE

How Does Automatic Mixed Precision Work?

Automatic Mixed Precision (AMP) is a software feature that dynamically selects optimal numerical precisions for different operations to accelerate neural network inference while managing numerical stability.

Automatic Mixed Precision (AMP) is a compiler-level optimization that automatically casts tensors between FP32 (single-precision) and lower-precision formats like FP16 or BF16 within a model's computational graph. It identifies operations that benefit from speed and memory savings at lower precision and those requiring FP32 for numerical stability, such as reductions or small magnitude gradients. This automation eliminates the need for manual precision annotations, reducing developer overhead and minimizing the risk of underflow or overflow that can degrade model accuracy.

During execution, AMP typically employs loss scaling to prevent gradient values from vanishing when using FP16. It also leverages hardware support for mixed precision, such as NVIDIA Tensor Cores, which perform matrix operations much faster in reduced precision. The primary goal is to optimize the latency-accuracy trade-off, achieving near-FP32 accuracy with significantly higher throughput and lower memory consumption, which is critical for cost-effective model serving architectures and on-device inference.

AUTOMATIC MIXED PRECISION (AMP)

Framework Implementations and Usage

Automatic Mixed Precision (AMP) is a software feature, commonly implemented in frameworks like PyTorch and TensorFlow, that automatically selects appropriate numerical precisions for different operations to accelerate training and inference while managing numerical stability.

PyTorch AMP (torch.cuda.amp)

PyTorch's native AMP implementation provides a context manager (autocast) and a gradient scaler (GradScaler).

Autocast Context: Automatically casts operations to FP16 within its scope, while keeping others in FP32 for stability.
GradScaler: Applies loss scaling to prevent gradient underflow in FP16, then unscales gradients before the optimizer step.
Usage Pattern: Wraps the forward pass in autocast, computes loss, calls scaler.scale(loss).backward(), then scaler.step(optimizer) and scaler.update().
Benefit: Enables Tensor Core usage on NVIDIA GPUs, often providing 1.5x to 3x training speedups with minimal code changes.

EXPLORE

TensorFlow Mixed Precision API

TensorFlow's tf.keras.mixed_precision API offers a policy-based approach to enable AMP.

Policy Setting: A global Policy (e.g., 'mixed_float16') dictates the default layer dtype and variable dtype.
Layer & Variable Casting: Layers automatically cast inputs to their compute dtype; float32 variables are maintained for numeric stability.
Loss Scaling: Integrated via a LossScaleOptimizer wrapper that manages scaling and unscaling of gradients.
Model Build & Training: The policy is set before model construction. The framework handles most precision decisions, though certain layers (like Softmax) may remain in float32 by default.

EXPLORE

NVIDIA Apex (Legacy PyTorch AMP)

NVIDIA's Apex library was the original third-party AMP solution for PyTorch, offering multiple optimization levels.

Opt Levels: Provided O0 (FP32), O1 (Conservative AMP), O2 (Fast AMP), and O3 (FP16). O1 was the recommended safe default.
Dynamic Loss Scaling: Included robust, dynamic loss scaling algorithms.
Evolution: Apex's core AMP functionality was officially integrated into PyTorch as torch.cuda.amp. Apex is now considered legacy, with PyTorch's native implementation being the standard.

EXPLORE

JAX Automatic Mixed Precision

JAX provides mixed precision support through its jax.experimental module and compiler transformations.

jax.experimental.maps & pjit: Mixed precision can be applied within parallel computation contexts.
Custom Gradients: Precision policies can be defined for forward and backward passes using custom gradient transformations.
Compiler Integration: JAX's XLA compiler performs further automatic precision optimizations during JIT compilation.
Library Support: High-level libraries like Flax and Haiku build on JAX's primitives to offer mixed precision training utilities.

EXPLORE

Inference-Specific Implementations

For inference, AMP is often integrated into dedicated optimization engines that perform static graph analysis.

TensorRT: Uses a calibration step to determine which layers can safely run in FP16 or INT8, applying AMP automatically during graph optimization and kernel selection.
ONNX Runtime: Applies graph-level transformations to insert cast operations and select high-performance kernels for different precisions based on the execution provider.
Core Concept: Inference AMP is typically static; precision choices are baked into the optimized model graph after analysis, minimizing runtime overhead.

Key Implementation Patterns

All AMP implementations share common architectural patterns to manage the precision trade-off.

Operator Whitelist/Blacklist: Frameworks maintain lists of operations safe for FP16 (e.g., convolutions, matmuls) and those that require FP32 (e.g., reductions, exponent-based functions).
Master Weights: Optimizer states (e.g., momentum) are often kept in FP32 (master weights) for stability, even when gradients are computed in FP16.
Automatic Cast Insertion: The core of AMP is the automatic insertion of cast operations (e.g., float32 -> float16 for inputs, float16 -> float32 for sensitive outputs) into the computational graph.
Numerical Safety Nets: Techniques like loss scaling and maintaining FP32 copies for specific operations are universal safeguards against underflow and overflow.

NUMERICAL FORMAT COMPARISON

Precision Formats in AMP: FP32, FP16, and BF16

A technical comparison of the three primary floating-point formats used in Automatic Mixed Precision (AMP) for deep learning inference, detailing their bit-level structure, hardware utilization, and suitability for different operations.

Feature / Metric	FP32 (Single-Precision)	FP16 (Half-Precision)	BF16 (Brain Float 16)
Total Bits	32	16	16
Sign Bits	1	1	1
Exponent Bits	8	5	8
Mantissa (Fraction) Bits	23	10	7
Dynamic Range (approx.)	1.2e-38 to 3.4e+38	5.96e-8 to 65504	1.18e-38 to 3.39e+38
Memory Footprint (vs FP32)	100% (Baseline)	50%	50%
Typical Hardware Throughput	1x (Baseline)	2-8x on Tensor Cores	2-8x on Tensor Cores
Primary Use Case in AMP	Master weights, sensitive ops (e.g., softmax)	Activations, gradients, most GEMM ops	Activations, gradients, most GEMM ops
Risk of Underflow/Overflow	Very Low	High (small exponent)	Low (matches FP32 exponent)
Numerical Stability	Highest	Requires loss scaling	High (inherits FP32 range)
Hardware Support	Universal	Modern GPUs (Pascal+), NPUs	Modern GPUs (Ampere+), TPUs, some CPUs

AUTOMATIC MIXED PRECISION (AMP)

Frequently Asked Questions

Automatic Mixed Precision (AMP) is a critical software technique for accelerating neural network training and inference by strategically using lower-precision numerical formats. This FAQ addresses its core mechanisms, benefits, and practical implementation details.

Automatic Mixed Precision (AMP) is a software feature, implemented in frameworks like PyTorch (torch.cuda.amp) and TensorFlow (tf.keras.mixed_precision), that automatically selects appropriate numerical precisions for different operations within a neural network to accelerate computation while managing numerical stability. It works by performing a graph-level analysis to identify which operations can safely use FP16 or BF16 (half-precision) and which require FP32 (full-precision) to prevent issues like underflow, overflow, or excessive quantization error. Key mechanisms include:

Automatic Casting: The framework inserts precision casting operations to convert tensors to lower precision for eligible compute-intensive ops (like matrix multiplications) and back to higher precision for sensitive ops (like reductions).
Loss Scaling: To prevent gradient underflow, AMP automatically applies loss scaling, multiplying the loss by a factor before backpropagation to keep gradient values in a representable range for FP16, then unscaling them before the optimizer step.
Operator Whitelist/Blacklist: Frameworks maintain internal lists identifying which operators are numerically safe (whitelist) or unsafe (blacklist/fallback list) for reduced precision.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

MIXED PRECISION INFERENCE

Related Terms

Automatic Mixed Precision (AMP) operates within a broader ecosystem of techniques and hardware designed to optimize inference through numerical precision management. The following terms are foundational to understanding its implementation and trade-offs.

Mixed Precision Inference

The overarching computational strategy of using different numerical formats (e.g., FP16, BF16, INT8) within a single model during execution. The goal is to optimize memory bandwidth, computational speed, and energy efficiency.

Core Principle: Assign higher precision (e.g., FP32) to operations sensitive to numerical error (like small gradient accumulations) and lower precision to bulk compute operations (like matrix multiplications).
Hardware Synergy: Maximizes the throughput of specialized units like NVIDIA Tensor Cores or AMD Matrix Cores.
Contrast with AMP: Mixed Precision Inference is the goal; AMP is an automated methodology to achieve it.

Quantization

A model compression technique that reduces the bit-width of a neural network's weights and activations. It is a key enabler for mixed precision inference, especially for integer formats.

Purpose: Decreases model size and memory footprint, reduces memory bandwidth pressure, and allows the use of faster, lower-power integer arithmetic units.
Primary Types: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).
Common Targets: Converting from FP32 to INT8 (4x size reduction) or FP16/BF16 (2x size reduction).
Relation to AMP: AMP may automatically apply quantization-like casting (e.g., to FP16), but dedicated quantization techniques are more aggressive, targeting INT8 and below.

BFloat16 (BF16)

A 16-bit floating-point format designed specifically for deep learning. It preserves the 8-bit exponent of FP32, matching its dynamic range, but truncates the mantissa (significand).

Key Advantage: Greatly reduces the risk of numerical overflow/underflow compared to FP16, making it more robust for training and inference without complex loss scaling.
Hardware Support: Native support on modern AI accelerators (e.g., NVIDIA A100+, AMD MI200+, Google TPUs, Intel Xeon CPUs with AMX).
AMP Role: A preferred target precision for AMP in frameworks and on hardware where it is available, as it offers a safer speed-up than FP16.

Loss Scaling

A critical technique in mixed precision training (often managed by AMP) to prevent gradient underflow when using FP16. It is less relevant for inference-only AMP.

Mechanism: The forward pass loss value is multiplied by a scale factor (e.g., 1024). This scaling propagates to the gradients during backpropagation, keeping them in a representable range for FP16.
Optimizer Step: Gradients are unscaled before the weight update to maintain the correct magnitude.
Automatic in AMP: Frameworks like PyTorch AMP (torch.cuda.amp.GradScaler) dynamically adjust this scale factor to find its optimal value during training.

Numerical Stability

The property of a computational system to produce correct, non-disrupted outputs despite the rounding errors, limited range, and precision loss inherent in floating-point arithmetic, especially at lower bit-widths.

Risks in Low Precision: Underflow (values becoming zero), overflow (values becoming infinity), and excessive quantization error.
AMP's Challenge: A primary function of AMP is to automatically manage this stability—for example, by keeping certain operations in FP32 (like reductions) or applying loss scaling—to prevent divergence or accuracy collapse.
Engineering Trade-off: The central balance in mixed precision is between the performance gains of lower precision and the preservation of numerical stability.

Hardware Support for Mixed Precision

The specialized silicon and instruction sets in modern processors designed to execute low-precision operations with maximal throughput and energy efficiency. AMP leverages this support.

Key Components:
- Tensor Cores/Matrix Cores: Dedicated units for mixed-precision matrix multiply-accumulate operations (e.g., FP16 input, FP32 accumulation).
- Integer Arithmetic Logic Units (ALUs): High-throughput units for INT8/INT4 operations.
Examples:
- NVIDIA: Tensor Cores in Volta architecture and later.
- AMD: Matrix Cores in CDNA architecture (MI series).
- Intel: Advanced Matrix Extensions (AMX) in Xeon CPUs.
Implication for AMP: Without this hardware, casting to lower precision may offer no speed benefit or could even slow down computation.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Automatic Mixed Precision (AMP)

What is Automatic Mixed Precision (AMP)?

Core Mechanisms of AMP

Precision Casting & Operator Selection

Loss Scaling for Gradient Stability

Master Weights in Optimizer State

Hardware Kernel Dispatch & Tensor Cores

Numerical Safety Guards & Promotions

Integration with Quantization & Calibration

How Does Automatic Mixed Precision Work?

Framework Implementations and Usage

PyTorch AMP (torch.cuda.amp)

TensorFlow Mixed Precision API

NVIDIA Apex (Legacy PyTorch AMP)

JAX Automatic Mixed Precision

Inference-Specific Implementations

Key Implementation Patterns

Precision Formats in AMP: FP32, FP16, and BF16

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there