Automatic Mixed Precision (AMP) is a feature in deep learning frameworks like PyTorch (torch.cuda.amp) and TensorFlow (tf.keras.mixed_precision) that dynamically manages numerical precision to accelerate computation. It automatically casts certain operations to lower-precision formats like FP16 or BF16 to leverage faster hardware execution, while keeping other operations in higher precision like FP32 to preserve numerical stability. This reduces memory bandwidth and increases computational throughput without requiring manual type casting by the developer.
Glossary
Automatic Mixed Precision (AMP)

What is Automatic Mixed Precision (AMP)?
A software-level technique for accelerating neural network training and inference by automatically selecting optimal numerical precisions for different operations.
The core mechanism involves two components: a gradient scaler to prevent underflow during training and a policy-based casting system for inference. During inference, AMP analyzes the model graph to apply lower precision to compute-intensive operations (e.g., matrix multiplications in linear layers) while keeping sensitive operations (e.g., reductions, softmax) in higher precision. This automation provides a near-optimal latency-accuracy trade-off, directly reducing inference cost and latency on supported hardware like NVIDIA GPUs with Tensor Cores.
Core Mechanisms of AMP
Automatic Mixed Precision (AMP) is a software feature that dynamically selects numerical precisions for different operations to accelerate inference while managing numerical stability. Its core mechanisms automate the complex trade-offs between speed and accuracy.
Precision Casting & Operator Selection
AMP's primary mechanism is the automatic insertion of precision casting operations (e.g., float32 to float16) into the model's computational graph. It uses a predefined operator whitelist/blacklist to decide which operations are safe to run in lower precision (FP16/BF16) and which must remain in high precision (FP32) for stability.
- Whitelisted Ops: Convolutions, matrix multiplications. These are compute-bound and benefit massively from the 2-8x throughput of Tensor Cores/Matrix Cores.
- Blacklisted Ops: Reductions, exponentiation, logarithms. These are often numerically sensitive and stay in FP32.
- Gray-listed Ops: Conditional, handled on a case-by-case basis.
The casting is performed automatically by the framework's autocast context manager, eliminating manual torch.cuda.amp.autocast() or tf.train.MixedPrecisionPolicy code.
Loss Scaling for Gradient Stability
During the fine-tuning or calibration phase of an AMP workflow, a critical mechanism prevents gradient underflow. When activations are in FP16, backpropagated gradients can become too small (below ~6e-8) and flush to zero.
AMP employs dynamic loss scaling:
- The forward pass loss is multiplied by a scale factor (e.g., 2^16).
- Gradients are scaled up proportionally, keeping them within FP16's representable range.
- After the backward pass, gradients are unscaled before the optimizer step.
- The system monitors for gradient overflow (inf/NaN). If overflow is detected, the optimizer step is skipped, and the scale factor is reduced for the next iteration.
This mechanism is essential for maintaining training stability when using FP16, making it a cornerstone of frameworks like PyTorch AMP and TensorFlow's mixed precision policies.
Master Weights in Optimizer State
To ensure convergence accuracy, AMP maintains a copy of model parameters in full FP32 precision, known as master weights. The mechanism works as follows:
- Forward/Backward Pass: Conducted in FP16/BF16 for speed.
- Optimizer State: The optimizer (e.g., Adam, SGD) stores and updates the FP32 master weights.
- Weight Update: Gradients (unscaled after loss scaling) are applied to the master weights in FP32 precision, preserving update fidelity.
- Copy Down: Before the next forward pass, the updated FP32 master weights are cast down to FP16/BF16 for the model's working weights.
This decoupling allows the compute-intensive forward/backward passes to leverage low-precision speed, while the critical weight update retains high-precision numerical stability. It is a key differentiator from simple, manual FP16 inference.
Hardware Kernel Dispatch & Tensor Cores
AMP's performance gains are realized through hardware-aware kernel dispatch. When AMP casts tensors to FP16/BF16, it enables the framework's backend (e.g., CUDA, cuDNN, oneDNN) to select highly optimized, low-precision kernels.
On NVIDIA GPUs with Tensor Cores and AMD GPUs with Matrix Cores, this triggers the use of specialized arithmetic units that perform mixed-precision matrix operations with drastically higher throughput:
- FP16 matrix multiply with FP32 accumulate.
- BF16 support on Ampere architecture and later.
- INT8 via separate quantization workflows.
AMP automatically ensures data is formatted and aligned to meet the strict requirements of these hardware units, maximizing FLOPs utilization and reducing kernel launch overhead by favoring fused operations where possible.
Numerical Safety Guards & Promotions
AMP incorporates automatic numerical safety guards to prevent instability. These are rules that temporarily promote operations back to FP32 to avoid overflow, underflow, or excessive rounding error.
Common promotion triggers include:
- Reduction operations (sum, mean) across large tensors, where FP16's limited range can overflow.
- Normalization operations (LayerNorm, Softmax) where exponentiation can cause overflow in FP16.
- Certain arithmetic sequences known to cause catastrophic precision loss.
These promotions are handled transparently by the framework's autocast logic. The system may also insert checkpointing casts to ensure intermediate values between promoted and non-promoted regions are correctly typed, maintaining the integrity of the computational graph.
Integration with Quantization & Calibration
For inference optimization, AMP often integrates with post-training quantization (PTQ) pipelines. The mechanism involves a calibration phase where AMP manages precision during data collection for quantization parameters.
- Calibration Forward Pass: AMP runs in inference mode, using FP16/BF16 for most layers to speed up the calibration process.
- Data Range Collection: Statistics (min/max) for activations are collected in their runtime precision (FP16/BF16) or are promoted to FP32 for accuracy, depending on the quantization scheme.
- Smooth Transition to INT8: The calibrated model can then be quantized to INT8. AMP's precision casting graph serves as a blueprint for where quantize/dequantize (Q/DQ) nodes should be inserted in formats like TensorRT or ONNX Runtime.
This makes AMP a foundational tool for building multi-stage precision reduction pipelines, bridging pure FP32 models to highly optimized INT8 deployments.
How Does Automatic Mixed Precision Work?
Automatic Mixed Precision (AMP) is a software feature that dynamically selects optimal numerical precisions for different operations to accelerate neural network inference while managing numerical stability.
Automatic Mixed Precision (AMP) is a compiler-level optimization that automatically casts tensors between FP32 (single-precision) and lower-precision formats like FP16 or BF16 within a model's computational graph. It identifies operations that benefit from speed and memory savings at lower precision and those requiring FP32 for numerical stability, such as reductions or small magnitude gradients. This automation eliminates the need for manual precision annotations, reducing developer overhead and minimizing the risk of underflow or overflow that can degrade model accuracy.
During execution, AMP typically employs loss scaling to prevent gradient values from vanishing when using FP16. It also leverages hardware support for mixed precision, such as NVIDIA Tensor Cores, which perform matrix operations much faster in reduced precision. The primary goal is to optimize the latency-accuracy trade-off, achieving near-FP32 accuracy with significantly higher throughput and lower memory consumption, which is critical for cost-effective model serving architectures and on-device inference.
Framework Implementations and Usage
Automatic Mixed Precision (AMP) is a software feature, commonly implemented in frameworks like PyTorch and TensorFlow, that automatically selects appropriate numerical precisions for different operations to accelerate training and inference while managing numerical stability.
Inference-Specific Implementations
For inference, AMP is often integrated into dedicated optimization engines that perform static graph analysis.
- TensorRT: Uses a calibration step to determine which layers can safely run in FP16 or INT8, applying AMP automatically during graph optimization and kernel selection.
- ONNX Runtime: Applies graph-level transformations to insert cast operations and select high-performance kernels for different precisions based on the execution provider.
- Core Concept: Inference AMP is typically static; precision choices are baked into the optimized model graph after analysis, minimizing runtime overhead.
Key Implementation Patterns
All AMP implementations share common architectural patterns to manage the precision trade-off.
- Operator Whitelist/Blacklist: Frameworks maintain lists of operations safe for FP16 (e.g., convolutions, matmuls) and those that require FP32 (e.g., reductions, exponent-based functions).
- Master Weights: Optimizer states (e.g., momentum) are often kept in FP32 (master weights) for stability, even when gradients are computed in FP16.
- Automatic Cast Insertion: The core of AMP is the automatic insertion of cast operations (e.g.,
float32 -> float16for inputs,float16 -> float32for sensitive outputs) into the computational graph. - Numerical Safety Nets: Techniques like loss scaling and maintaining FP32 copies for specific operations are universal safeguards against underflow and overflow.
Precision Formats in AMP: FP32, FP16, and BF16
A technical comparison of the three primary floating-point formats used in Automatic Mixed Precision (AMP) for deep learning inference, detailing their bit-level structure, hardware utilization, and suitability for different operations.
| Feature / Metric | FP32 (Single-Precision) | FP16 (Half-Precision) | BF16 (Brain Float 16) |
|---|---|---|---|
Total Bits | 32 | 16 | 16 |
Sign Bits | 1 | 1 | 1 |
Exponent Bits | 8 | 5 | 8 |
Mantissa (Fraction) Bits | 23 | 10 | 7 |
Dynamic Range (approx.) | 1.2e-38 to 3.4e+38 | 5.96e-8 to 65504 | 1.18e-38 to 3.39e+38 |
Memory Footprint (vs FP32) | 100% (Baseline) | 50% | 50% |
Typical Hardware Throughput | 1x (Baseline) | 2-8x on Tensor Cores | 2-8x on Tensor Cores |
Primary Use Case in AMP | Master weights, sensitive ops (e.g., softmax) | Activations, gradients, most GEMM ops | Activations, gradients, most GEMM ops |
Risk of Underflow/Overflow | Very Low | High (small exponent) | Low (matches FP32 exponent) |
Numerical Stability | Highest | Requires loss scaling | High (inherits FP32 range) |
Hardware Support | Universal | Modern GPUs (Pascal+), NPUs | Modern GPUs (Ampere+), TPUs, some CPUs |
Frequently Asked Questions
Automatic Mixed Precision (AMP) is a critical software technique for accelerating neural network training and inference by strategically using lower-precision numerical formats. This FAQ addresses its core mechanisms, benefits, and practical implementation details.
Automatic Mixed Precision (AMP) is a software feature, implemented in frameworks like PyTorch (torch.cuda.amp) and TensorFlow (tf.keras.mixed_precision), that automatically selects appropriate numerical precisions for different operations within a neural network to accelerate computation while managing numerical stability. It works by performing a graph-level analysis to identify which operations can safely use FP16 or BF16 (half-precision) and which require FP32 (full-precision) to prevent issues like underflow, overflow, or excessive quantization error. Key mechanisms include:
- Automatic Casting: The framework inserts precision casting operations to convert tensors to lower precision for eligible compute-intensive ops (like matrix multiplications) and back to higher precision for sensitive ops (like reductions).
- Loss Scaling: To prevent gradient underflow, AMP automatically applies loss scaling, multiplying the loss by a factor before backpropagation to keep gradient values in a representable range for FP16, then unscaling them before the optimizer step.
- Operator Whitelist/Blacklist: Frameworks maintain internal lists identifying which operators are numerically safe (whitelist) or unsafe (blacklist/fallback list) for reduced precision.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Automatic Mixed Precision (AMP) operates within a broader ecosystem of techniques and hardware designed to optimize inference through numerical precision management. The following terms are foundational to understanding its implementation and trade-offs.
Mixed Precision Inference
The overarching computational strategy of using different numerical formats (e.g., FP16, BF16, INT8) within a single model during execution. The goal is to optimize memory bandwidth, computational speed, and energy efficiency.
- Core Principle: Assign higher precision (e.g., FP32) to operations sensitive to numerical error (like small gradient accumulations) and lower precision to bulk compute operations (like matrix multiplications).
- Hardware Synergy: Maximizes the throughput of specialized units like NVIDIA Tensor Cores or AMD Matrix Cores.
- Contrast with AMP: Mixed Precision Inference is the goal; AMP is an automated methodology to achieve it.
Quantization
A model compression technique that reduces the bit-width of a neural network's weights and activations. It is a key enabler for mixed precision inference, especially for integer formats.
- Purpose: Decreases model size and memory footprint, reduces memory bandwidth pressure, and allows the use of faster, lower-power integer arithmetic units.
- Primary Types: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).
- Common Targets: Converting from FP32 to INT8 (4x size reduction) or FP16/BF16 (2x size reduction).
- Relation to AMP: AMP may automatically apply quantization-like casting (e.g., to FP16), but dedicated quantization techniques are more aggressive, targeting INT8 and below.
BFloat16 (BF16)
A 16-bit floating-point format designed specifically for deep learning. It preserves the 8-bit exponent of FP32, matching its dynamic range, but truncates the mantissa (significand).
- Key Advantage: Greatly reduces the risk of numerical overflow/underflow compared to FP16, making it more robust for training and inference without complex loss scaling.
- Hardware Support: Native support on modern AI accelerators (e.g., NVIDIA A100+, AMD MI200+, Google TPUs, Intel Xeon CPUs with AMX).
- AMP Role: A preferred target precision for AMP in frameworks and on hardware where it is available, as it offers a safer speed-up than FP16.
Loss Scaling
A critical technique in mixed precision training (often managed by AMP) to prevent gradient underflow when using FP16. It is less relevant for inference-only AMP.
- Mechanism: The forward pass loss value is multiplied by a scale factor (e.g., 1024). This scaling propagates to the gradients during backpropagation, keeping them in a representable range for FP16.
- Optimizer Step: Gradients are unscaled before the weight update to maintain the correct magnitude.
- Automatic in AMP: Frameworks like PyTorch AMP (
torch.cuda.amp.GradScaler) dynamically adjust this scale factor to find its optimal value during training.
Numerical Stability
The property of a computational system to produce correct, non-disrupted outputs despite the rounding errors, limited range, and precision loss inherent in floating-point arithmetic, especially at lower bit-widths.
- Risks in Low Precision: Underflow (values becoming zero), overflow (values becoming infinity), and excessive quantization error.
- AMP's Challenge: A primary function of AMP is to automatically manage this stability—for example, by keeping certain operations in FP32 (like reductions) or applying loss scaling—to prevent divergence or accuracy collapse.
- Engineering Trade-off: The central balance in mixed precision is between the performance gains of lower precision and the preservation of numerical stability.
Hardware Support for Mixed Precision
The specialized silicon and instruction sets in modern processors designed to execute low-precision operations with maximal throughput and energy efficiency. AMP leverages this support.
- Key Components:
- Tensor Cores/Matrix Cores: Dedicated units for mixed-precision matrix multiply-accumulate operations (e.g., FP16 input, FP32 accumulation).
- Integer Arithmetic Logic Units (ALUs): High-throughput units for INT8/INT4 operations.
- Examples:
- NVIDIA: Tensor Cores in Volta architecture and later.
- AMD: Matrix Cores in CDNA architecture (MI series).
- Intel: Advanced Matrix Extensions (AMX) in Xeon CPUs.
- Implication for AMP: Without this hardware, casting to lower precision may offer no speed benefit or could even slow down computation.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us