Glossary

Mixed Precision Inference

Terms related to using different numerical formats (e.g., FP16, BF16, INT8) within a single inference pass. Target: [ML Engineers, Hardware Architects].

Get in touch Learn more

Developer testing AI inference on mobile phone in hand, laptop with optimization code visible, casual tech review moment.

Glossary

Mixed Precision Inference

Terms related to using different numerical formats (e.g., FP16, BF16, INT8) within a single inference pass. Target: [ML Engineers, Hardware Architects].

Mixed Precision Inference

Mixed precision inference is a computational technique that uses different numerical data types (e.g., FP16, BF16, INT8) within a single model during execution to optimize memory usage, computational speed, and energy efficiency without significantly compromising accuracy.

Quantization

Quantization is a model compression technique that reduces the numerical precision of a neural network's weights and activations (e.g., from 32-bit floating-point to 8-bit integers) to decrease model size and accelerate inference.

Post-Training Quantization (PTQ)

Post-training quantization is a process that converts a pre-trained model to a lower precision format using a calibration dataset, without requiring any retraining, to reduce its memory footprint and computational cost.

Quantization-Aware Training (QAT)

Quantization-aware training is a method where a model is trained or fine-tuned with simulated quantization operations, allowing it to learn to compensate for the precision loss and typically achieve higher accuracy than post-training quantization.

BFloat16 (BF16)

BFloat16 is a 16-bit floating-point format that preserves the dynamic range of a standard 32-bit float (FP32) by using the same 8-bit exponent, making it particularly suitable for deep learning training and inference on modern hardware.

FP16 (Half-Precision)

FP16, or half-precision floating-point, is a 16-bit numerical format that reduces memory bandwidth and can accelerate computation on supported hardware, but has a smaller dynamic range than FP32 or BF16, risking numerical underflow or overflow.

INT8 Quantization

INT8 quantization is a technique that represents model weights and activations using 8-bit integers, typically offering a 4x reduction in model size and memory bandwidth compared to FP32, enabling faster inference on integer-optimized hardware.

Calibration (Quantization)

Calibration in quantization is the process of analyzing a sample dataset (calibration dataset) to determine the optimal scaling factors and zero-point values for converting floating-point tensors to a lower-bit integer representation.

Per-Tensor vs. Per-Channel Quantization

Per-tensor quantization applies a single scale and zero-point value to an entire tensor, while per-channel quantization uses separate values for each channel (e.g., each output channel of a weight tensor), offering finer granularity and often better accuracy.

Dequantization

Dequantization is the inverse operation of quantization, which converts low-precision integer values back into floating-point numbers, often performed during or after computation to preserve numerical fidelity for certain operations.

Automatic Mixed Precision (AMP)

Automatic mixed precision is a software feature, commonly implemented in frameworks like PyTorch and TensorFlow, that automatically selects appropriate numerical precisions for different operations to accelerate training and inference while managing numerical stability.

Loss Scaling (Gradient Scaling)

Loss scaling is a technique used in mixed precision training where the loss value is multiplied by a scale factor before backpropagation to prevent gradient values in FP16 from underflowing to zero, with gradients being unscaled before the optimizer step.

Numerical Stability

Numerical stability in mixed precision computing refers to the avoidance of problematic conditions like underflow, overflow, or excessive rounding error that can degrade or invalidate model outputs when using reduced precision formats.

TensorRT

TensorRT is NVIDIA's high-performance deep learning inference SDK and optimizer, which provides layer fusion, precision calibration, and kernel auto-tuning to deploy models with low latency and high throughput on NVIDIA GPUs.

ONNX Runtime

ONNX Runtime is a cross-platform inference and training accelerator for models in the Open Neural Network Exchange format, offering performance optimizations including graph transformations and quantization for various hardware backends.

TFLite (TensorFlow Lite)

TensorFlow Lite is a lightweight framework for deploying machine learning models on mobile, embedded, and edge devices, featuring tools for model conversion, quantization, and hardware acceleration via delegates.

Quantization Error

Quantization error is the difference between the original full-precision value and its quantized representation, arising from the rounding and clipping inherent in the quantization process, which can accumulate and affect model accuracy.

Symmetric vs. Asymmetric Quantization

Symmetric quantization centers the quantized range around zero, simplifying computation, while asymmetric quantization uses a separate zero-point to align the quantized range with the actual distribution of the tensor values, potentially offering better accuracy.

Dynamic Quantization

Dynamic quantization is a method where scaling factors for activations are determined at runtime based on the observed data range for each inference, as opposed to being predetermined during a static calibration phase.

Static Quantization

Static quantization pre-computes quantization parameters (scale and zero-point) for both weights and activations using a calibration dataset prior to inference, leading to fixed computational graphs and typically lower runtime overhead.

Fake Quantization

Fake quantization is a simulation technique that inserts nodes into a computational graph to mimic the effects of quantization (rounding and clipping) during training or calibration, without actually changing the underlying numerical precision of the tensors.

Hardware Support for Mixed Precision

Hardware support for mixed precision refers to the specialized arithmetic units (e.g., Tensor Cores, Matrix Cores) and instruction sets in modern processors and accelerators designed to execute low-precision operations with high throughput and energy efficiency.

Quantization-Aware Fine-Tuning

Quantization-aware fine-tuning is the process of further training a quantized model, or a model with fake quantization nodes, on a task-specific dataset to recover accuracy lost during the quantization process.

Latency-Accuracy Trade-off

The latency-accuracy trade-off in mixed precision inference describes the engineering balance between achieving lower inference time (through reduced precision) and maintaining acceptable model prediction accuracy.

Model Casting (Precision Casting)

Model casting, or precision casting, is the explicit conversion of tensors from one numerical data type to another (e.g., FP32 to FP16) within a model's computational graph, a fundamental operation in mixed precision workflows.

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Mixed Precision Inference

Mixed Precision Inference

Mixed Precision Inference

Quantization

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

BFloat16 (BF16)

FP16 (Half-Precision)

INT8 Quantization

Calibration (Quantization)

Per-Tensor vs. Per-Channel Quantization

Dequantization

Automatic Mixed Precision (AMP)

Loss Scaling (Gradient Scaling)

Numerical Stability

TensorRT

ONNX Runtime

TFLite (TensorFlow Lite)

Quantization Error

Symmetric vs. Asymmetric Quantization

Dynamic Quantization

Static Quantization

Fake Quantization

Hardware Support for Mixed Precision

Quantization-Aware Fine-Tuning

Latency-Accuracy Trade-off

Model Casting (Precision Casting)

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there