Glossary

Post-Training Quantization (PTQ)

Post-Training Quantization (PTQ) is a model compression technique that reduces the numerical precision of a neural network's weights and activations after training to shrink its memory footprint and accelerate inference, without requiring retraining.

Get in touch Learn more

MODEL COMPRESSION

What is Post-Training Quantization (PTQ)?

Post-Training Quantization (PTQ) is a critical compression technique for deploying neural networks on resource-constrained hardware, enabling efficient inference without the computational overhead of further training.

Post-Training Quantization (PTQ) is a model compression technique that reduces the numerical precision of a pre-trained neural network's weights and activations—for example, from 32-bit floating-point (FP32) to 8-bit integers (INT8)—without requiring further gradient-based training. The primary goal is to shrink the model's memory footprint and accelerate inference on hardware optimized for integer arithmetic, such as CPUs, mobile processors, and edge AI accelerators. This process is performed after the model has been fully trained and typically involves analyzing a small, representative calibration dataset to determine optimal scaling factors (quantization parameters) that map the float range to the integer range with minimal distortion.

The core challenge PTQ addresses is quantization error—the information loss from reducing precision. Advanced methods like GPTQ, AWQ, and SmoothQuant employ sophisticated strategies (e.g., using second-order Hessian information or activation-aware scaling) to protect the most salient weights and minimize accuracy degradation. Unlike Quantization-Aware Training (QAT), PTQ is a faster, data-efficient process that does not update model weights, making it ideal for rapid deployment. It is a foundational step in the on-device inference optimization pipeline, directly enabling the deployment of large language models and vision models on edge devices.

MODEL COMPRESSION

Key Characteristics of PTQ

Post-training quantization (PTQ) is a compression technique that reduces the numerical precision of a model's weights and activations after training, enabling efficient deployment without further gradient updates.

Calibration Dataset Requirement

PTQ requires a small, representative calibration dataset (typically 128-512 samples) to analyze the statistical distribution of activations. This dataset is used to calculate scaling factors (quantization parameters) that map floating-point values to integer ranges with minimal information loss. No gradient-based learning occurs; the model's weights remain frozen during this profiling phase.

Precision Targets (INT8, INT4, FP8)

PTQ targets specific numerical formats to reduce memory and compute footprint. Common targets include:

INT8 (8-bit integer): The most common target, offering a 4x memory reduction from FP32 with typically <1% accuracy drop for many models.
INT4 (4-bit integer): Aggressive compression (8x reduction) requiring more sophisticated algorithms like GPTQ or AWQ to maintain accuracy.
FP8 (8-bit floating point): An emerging standard that preserves a dynamic range similar to higher precision floats, beneficial for models with large activation outliers.

Static vs. Dynamic Quantization

PTQ is categorized by when quantization parameters are determined:

Static Quantization: Scaling factors are computed once during calibration and remain fixed during inference. This is the most common and performant form of PTQ, as it allows for kernel fusion and hardware acceleration.
Dynamic Quantization: Scaling factors are computed on-the-fly for each input during inference. This handles variable input ranges better but introduces runtime overhead. It is often used for quantizing activations in models like LSTMs.

Weight-Only vs. Full-Integer Quantization

The scope of quantization defines the performance trade-off:

Weight-Only Quantization: Only the model's weights are converted to low precision (e.g., INT8). Activations remain in floating-point (FP16/FP32). This reduces model size and memory bandwidth but offers limited compute speed-up.
Full-Integer (Weight & Activation) Quantization: Both weights and activations are converted to integers (e.g., INT8). This enables the use of efficient integer arithmetic units (e.g., NVIDIA Tensor Cores, Intel VNNI) for maximal inference speed-up but is more sensitive to activation outliers.

Algorithmic Approaches (GPTQ, AWQ, SmoothQuant)

Advanced PTQ algorithms mitigate accuracy loss:

GPTQ: Uses layer-wise Hessian-based optimization to correct quantization errors, enabling accurate 4-bit weight quantization.
AWQ: Identifies and preserves (does not quantize) salient weights—those multiplied by large activation magnitudes—through a scaling transformation.
SmoothQuant: Statistically migrates the quantization difficulty from hard-to-quantize activations to easier-to-quantize weights via a per-channel smoothing factor, enabling performant 8-bit quantization of both.

Hardware Deployment Target

The primary goal of PTQ is to enable efficient execution on edge hardware and dedicated AI accelerators. These targets have specific requirements:

Mobile CPUs/GPUs (ARM, Adreno): Require INT8 or FP16 execution for power efficiency.
Neural Processing Units (NPUs): Often have dedicated integer compute pipelines (e.g., Google Edge TPU, Qualcomm Hexagon) that require full-integer quantized models.
Server-Grade AI Accelerators (NVIDIA TensorRT, Intel OpenVINO): Use PTQ to optimize models for their inference runtimes, maximizing throughput and reducing latency.

EXPLORE

QUANTIZATION METHOD COMPARISON

PTQ vs. Quantization-Aware Training (QAT)

A feature and workflow comparison between Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), two primary approaches for reducing model precision.

Feature / Metric	Post-Training Quantization (PTQ)	Quantization-Aware Training (QAT)
Primary Objective	Compress a pre-trained model without further training.	Train or fine-tune a model to be robust to quantization loss.
Required Compute	Low (calibration only).	High (full training cycle).
Typical Workflow Time	Minutes to hours.	Hours to days.
Required Data	Small, unlabeled calibration dataset (~100-1000 samples).	Full, labeled training dataset.
Model Performance	Slight degradation (0.5-2% accuracy drop for INT8).	Near-original FP32 performance (<0.5% drop).
Hardware Target	Broad (general INT8/INT4 accelerators).	Specific (optimized for target hardware).
Integration Complexity	Low (applied after training).	High (integrated into training loop).
Use Case	Production deployment of static models.	Maximizing accuracy for mission-critical, quantized models.

POST-TRAINING QUANTIZATION

Common PTQ Techniques & Algorithms

Post-training quantization (PTQ) algorithms are designed to compress a pre-trained model by reducing the numerical precision of its parameters and activations. These techniques use a small calibration dataset to determine optimal scaling factors without requiring further gradient-based training.

Static Quantization

Static quantization determines the quantization parameters (scale and zero-point) for both weights and activations by analyzing a single, representative calibration dataset. These parameters are then fixed for inference.

Process: The calibration pass records the range of activations, after which the model is converted to use integer operations.
Advantage: Eliminates runtime overhead for calculating quantization parameters, maximizing inference speed.
Use Case: The standard method for quantizing convolutional networks (CNNs) and transformers where activation ranges are stable.

Dynamic Quantization

Dynamic quantization determines the quantization parameters for activations on-the-fly for each input during inference, while weights are quantized statically beforehand.

Process: The scale and zero-point for a layer's output are computed based on the observed range of values for the current input batch.
Advantage: Handles inputs with highly variable value ranges better than static quantization, often improving accuracy for certain layers (e.g., LSTM/GRU outputs).
Trade-off: Introduces minor runtime overhead due to per-batch range calculation.

GPTQ (GPT Quantization)

GPTQ is a layer-wise, approximate second-order quantization method designed for compressing large generative language models to very low precision (e.g., 4-bit).

Mechanism: It uses the Hessian matrix (second-order derivatives) of the layer's weight reconstruction error to guide the quantization of weights in groups, minimizing the performance drop.
Key Feature: Enables high compression (2-4 bits per weight) with minimal accuracy loss and is performed post-training without fine-tuning.
Result: Produces models that run efficiently on consumer GPUs with libraries like bitsandbytes and auto-gptq.

AWQ (Activation-aware Weight Quantization)

AWQ is a PTQ method that identifies and protects a small subset of salient weights—those multiplied by large activation magnitudes—to preserve model quality at low bit-widths.

Core Insight: Not all weights are equally important; the impact of a weight is scaled by its corresponding activation. Protecting 1% of salient weights can preserve most of the model's performance.
Process: Scales weights and activations per channel to reduce the quantization error of these salient weights, enabling robust 4-bit quantization.
Benefit: Like GPTQ, it requires no retraining and maintains strong zero-shot task performance for language models.

SmoothQuant

SmoothQuant is a PTQ technique that addresses the challenge of quantizing transformer models with large, outlier values in their activations, which are difficult to represent in low-precision integers.

Problem: Outliers in activations (common in models like OPT and BLOOM) force high quantization error if activations are quantized directly.
Solution: Mathematically migrates the quantization difficulty from activations to weights by 'smoothing' the activation scales via a per-channel scaling factor absorbed into the preceding layer's weights.
Outcome: Enables 8-bit quantization of both weights and activations (W8A8) for full transformer inference, which is highly efficient on modern integer hardware.

Calibration Dataset & Metrics

The calibration dataset is a small, representative set of unlabeled data (typically 128-512 samples) used by PTQ algorithms to determine optimal quantization parameters.

Purpose: Used to observe the statistical range (min/max) of activations for static quantization or to compute Hessian information for methods like GPTQ.
Key Metrics: PTQ success is measured by:
- Task Accuracy Drop: The change in performance (e.g., perplexity, accuracy) on a benchmark after quantization. A drop of <1% is often considered successful.
- Model Size Reduction: e.g., reducing a 16-bit (FP16) model to 8-bit (INT8) cuts the model size in half.
- Inference Latency/Speedup: The reduction in compute time achieved by using integer arithmetic on supporting hardware (e.g., CPUs, NPUs).

POST-TRAINING QUANTIZATION

Frequently Asked Questions

Post-training quantization is a critical compression technique for deploying models on resource-constrained hardware. These FAQs address its core mechanisms, trade-offs, and practical implementation.

Post-training quantization is a model compression technique that reduces the numerical precision of a pre-trained neural network's weights and activations (e.g., from 32-bit floating-point to 8-bit integers) without requiring further gradient-based training. The process uses a small, representative calibration dataset to estimate the dynamic range (min/max values) of activations, enabling the calculation of scale and zero-point parameters that map floating-point values to integer representations. This drastically reduces the model's memory footprint and accelerates inference on hardware that natively supports integer arithmetic, such as CPUs, GPUs, and specialized Neural Processing Units.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

PARAMETER-EFFICIENT FINE-TUNING

Related Terms

Post-training quantization (PTQ) is a core model compression technique within the broader field of parameter-efficient fine-tuning and inference optimization. The following terms are essential for understanding its context, alternatives, and complementary methods.

Quantization-Aware Training (QAT)

Quantization-aware training is a process where a neural network is trained or fine-tuned with simulated quantization operations in the forward pass. This allows the model to learn parameters robust to the precision loss incurred during subsequent integer deployment. Unlike PTQ, which is applied after training, QAT bakes quantization error into the training loop, often yielding higher accuracy at ultra-low precisions (e.g., INT4).

Key Mechanism: Uses fake quantization nodes during training to model the rounding and clipping effects of integer arithmetic.
Trade-off: Requires a full training cycle, making it more computationally expensive than PTQ but more accurate.
Typical Use Case: Deploying models to ultra-constrained edge devices where every bit of precision matters.

GPTQ (GPT Quantization)

GPTQ is a state-of-the-art post-training quantization method specifically designed for compressing large transformer-based language models. It uses layer-wise compression based on approximate second-order information (Hessian matrices) to correct the error introduced by quantizing each weight.

Key Mechanism: Applies Optimal Brain Quantization principles, quantizing weights in a sequence while updating remaining unquantized weights to compensate for the loss.
Precision: Enables reliable 4-bit and lower precision quantization of model weights with minimal performance degradation.
Distinction: A leading algorithm for weight-only quantization, often used in conjunction with FP16 or 8-bit activations for inference.

AWQ (Activation-aware Weight Quantization)

AWQ is a post-training quantization method that scales model weights based on activation magnitudes to preserve critical information. It identifies that not all weights are equally important—salient weights (those multiplied by large activation outliers) are more sensitive to quantization.

Key Mechanism: Applies a per-channel scaling to protect these salient weights before quantization, then inversely scales the following layer's activations.
Advantage: A zero-shot method requiring no calibration data or backpropagation, making it fast and robust.
Outcome: Enables high-performance 4-bit quantization of both weights and activations, crucial for on-device LLM deployment.

SmoothQuant

SmoothQuant is a post-training quantization technique that addresses the challenge of quantizing transformer activations, which often contain extreme outliers. It migrates the quantization difficulty from activations to the more stable weights.

Key Mechanism: Applies a mathematical smoothing by scaling down the activations and scaling up the corresponding weights per channel, equalizing their dynamic ranges.
Primary Benefit: Enables per-tensor or 8-bit quantization of both weights and activations, simplifying hardware deployment.
Hardware Impact: Allows for efficient execution on integer-only hardware (e.g., many NPUs and GPUs) that cannot natively handle FP16 activations.

Pruning

Pruning is a model compression technique that removes redundant or less important parameters (weights, neurons, or layers) from a neural network. It is often used in conjunction with quantization to maximize compression.

Types: Includes unstructured pruning (individual weights) and structured pruning (entire channels or layers), with the latter being more hardware-friendly.
Process: Can be applied post-training or during training (pruning-aware training).
Synergy with PTQ: Creates a sparse model with fewer non-zero values, which is then quantized, leading to multiplicative size and latency reductions.

Knowledge Distillation

Knowledge distillation is a compression and transfer learning technique where a small, efficient student model is trained to mimic the behavior of a larger, more accurate teacher model. The student learns from both the teacher's output probabilities (soft labels) and the ground truth.

Relation to PTQ: Provides an alternative path to a small, fast model. A distilled model can then be further compressed via PTQ for edge deployment.
Key Benefit: Can capture the teacher's generalization ability and dark knowledge in a more compact architecture.
Common Use: Creating tiny language models for on-device use, which are prime candidates for subsequent quantization.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Post-Training Quantization (PTQ)

What is Post-Training Quantization (PTQ)?

Key Characteristics of PTQ

Calibration Dataset Requirement

Precision Targets (INT8, INT4, FP8)

Static vs. Dynamic Quantization

Weight-Only vs. Full-Integer Quantization

Algorithmic Approaches (GPTQ, AWQ, SmoothQuant)

Hardware Deployment Target

PTQ vs. Quantization-Aware Training (QAT)

Common PTQ Techniques & Algorithms

Static Quantization

Dynamic Quantization

GPTQ (GPT Quantization)

AWQ (Activation-aware Weight Quantization)

SmoothQuant

Calibration Dataset & Metrics

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there