Inferensys

Glossary

Post-Training Quantization (PTQ)

Post-Training Quantization (PTQ) is a model compression technique that reduces the numerical precision of a neural network's weights and activations after training to shrink its memory footprint and accelerate inference, without requiring retraining.
ML engineer working on model compression and quantization, laptop showing performance benchmarks, technical workspace.
MODEL COMPRESSION

What is Post-Training Quantization (PTQ)?

Post-Training Quantization (PTQ) is a critical compression technique for deploying neural networks on resource-constrained hardware, enabling efficient inference without the computational overhead of further training.

Post-Training Quantization (PTQ) is a model compression technique that reduces the numerical precision of a pre-trained neural network's weights and activations—for example, from 32-bit floating-point (FP32) to 8-bit integers (INT8)—without requiring further gradient-based training. The primary goal is to shrink the model's memory footprint and accelerate inference on hardware optimized for integer arithmetic, such as CPUs, mobile processors, and edge AI accelerators. This process is performed after the model has been fully trained and typically involves analyzing a small, representative calibration dataset to determine optimal scaling factors (quantization parameters) that map the float range to the integer range with minimal distortion.

The core challenge PTQ addresses is quantization error—the information loss from reducing precision. Advanced methods like GPTQ, AWQ, and SmoothQuant employ sophisticated strategies (e.g., using second-order Hessian information or activation-aware scaling) to protect the most salient weights and minimize accuracy degradation. Unlike Quantization-Aware Training (QAT), PTQ is a faster, data-efficient process that does not update model weights, making it ideal for rapid deployment. It is a foundational step in the on-device inference optimization pipeline, directly enabling the deployment of large language models and vision models on edge devices.

MODEL COMPRESSION

Key Characteristics of PTQ

Post-training quantization (PTQ) is a compression technique that reduces the numerical precision of a model's weights and activations after training, enabling efficient deployment without further gradient updates.

01

Calibration Dataset Requirement

PTQ requires a small, representative calibration dataset (typically 128-512 samples) to analyze the statistical distribution of activations. This dataset is used to calculate scaling factors (quantization parameters) that map floating-point values to integer ranges with minimal information loss. No gradient-based learning occurs; the model's weights remain frozen during this profiling phase.

02

Precision Targets (INT8, INT4, FP8)

PTQ targets specific numerical formats to reduce memory and compute footprint. Common targets include:

  • INT8 (8-bit integer): The most common target, offering a 4x memory reduction from FP32 with typically <1% accuracy drop for many models.
  • INT4 (4-bit integer): Aggressive compression (8x reduction) requiring more sophisticated algorithms like GPTQ or AWQ to maintain accuracy.
  • FP8 (8-bit floating point): An emerging standard that preserves a dynamic range similar to higher precision floats, beneficial for models with large activation outliers.
03

Static vs. Dynamic Quantization

PTQ is categorized by when quantization parameters are determined:

  • Static Quantization: Scaling factors are computed once during calibration and remain fixed during inference. This is the most common and performant form of PTQ, as it allows for kernel fusion and hardware acceleration.
  • Dynamic Quantization: Scaling factors are computed on-the-fly for each input during inference. This handles variable input ranges better but introduces runtime overhead. It is often used for quantizing activations in models like LSTMs.
04

Weight-Only vs. Full-Integer Quantization

The scope of quantization defines the performance trade-off:

  • Weight-Only Quantization: Only the model's weights are converted to low precision (e.g., INT8). Activations remain in floating-point (FP16/FP32). This reduces model size and memory bandwidth but offers limited compute speed-up.
  • Full-Integer (Weight & Activation) Quantization: Both weights and activations are converted to integers (e.g., INT8). This enables the use of efficient integer arithmetic units (e.g., NVIDIA Tensor Cores, Intel VNNI) for maximal inference speed-up but is more sensitive to activation outliers.
05

Algorithmic Approaches (GPTQ, AWQ, SmoothQuant)

Advanced PTQ algorithms mitigate accuracy loss:

  • GPTQ: Uses layer-wise Hessian-based optimization to correct quantization errors, enabling accurate 4-bit weight quantization.
  • AWQ: Identifies and preserves (does not quantize) salient weights—those multiplied by large activation magnitudes—through a scaling transformation.
  • SmoothQuant: Statistically migrates the quantization difficulty from hard-to-quantize activations to easier-to-quantize weights via a per-channel smoothing factor, enabling performant 8-bit quantization of both.
QUANTIZATION METHOD COMPARISON

PTQ vs. Quantization-Aware Training (QAT)

A feature and workflow comparison between Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), two primary approaches for reducing model precision.

Feature / MetricPost-Training Quantization (PTQ)Quantization-Aware Training (QAT)

Primary Objective

Compress a pre-trained model without further training.

Train or fine-tune a model to be robust to quantization loss.

Required Compute

Low (calibration only).

High (full training cycle).

Typical Workflow Time

Minutes to hours.

Hours to days.

Required Data

Small, unlabeled calibration dataset (~100-1000 samples).

Full, labeled training dataset.

Model Performance

Slight degradation (0.5-2% accuracy drop for INT8).

Near-original FP32 performance (<0.5% drop).

Hardware Target

Broad (general INT8/INT4 accelerators).

Specific (optimized for target hardware).

Integration Complexity

Low (applied after training).

High (integrated into training loop).

Use Case

Production deployment of static models.

Maximizing accuracy for mission-critical, quantized models.

POST-TRAINING QUANTIZATION

Common PTQ Techniques & Algorithms

Post-training quantization (PTQ) algorithms are designed to compress a pre-trained model by reducing the numerical precision of its parameters and activations. These techniques use a small calibration dataset to determine optimal scaling factors without requiring further gradient-based training.

01

Static Quantization

Static quantization determines the quantization parameters (scale and zero-point) for both weights and activations by analyzing a single, representative calibration dataset. These parameters are then fixed for inference.

  • Process: The calibration pass records the range of activations, after which the model is converted to use integer operations.
  • Advantage: Eliminates runtime overhead for calculating quantization parameters, maximizing inference speed.
  • Use Case: The standard method for quantizing convolutional networks (CNNs) and transformers where activation ranges are stable.
02

Dynamic Quantization

Dynamic quantization determines the quantization parameters for activations on-the-fly for each input during inference, while weights are quantized statically beforehand.

  • Process: The scale and zero-point for a layer's output are computed based on the observed range of values for the current input batch.
  • Advantage: Handles inputs with highly variable value ranges better than static quantization, often improving accuracy for certain layers (e.g., LSTM/GRU outputs).
  • Trade-off: Introduces minor runtime overhead due to per-batch range calculation.
03

GPTQ (GPT Quantization)

GPTQ is a layer-wise, approximate second-order quantization method designed for compressing large generative language models to very low precision (e.g., 4-bit).

  • Mechanism: It uses the Hessian matrix (second-order derivatives) of the layer's weight reconstruction error to guide the quantization of weights in groups, minimizing the performance drop.
  • Key Feature: Enables high compression (2-4 bits per weight) with minimal accuracy loss and is performed post-training without fine-tuning.
  • Result: Produces models that run efficiently on consumer GPUs with libraries like bitsandbytes and auto-gptq.
04

AWQ (Activation-aware Weight Quantization)

AWQ is a PTQ method that identifies and protects a small subset of salient weights—those multiplied by large activation magnitudes—to preserve model quality at low bit-widths.

  • Core Insight: Not all weights are equally important; the impact of a weight is scaled by its corresponding activation. Protecting 1% of salient weights can preserve most of the model's performance.
  • Process: Scales weights and activations per channel to reduce the quantization error of these salient weights, enabling robust 4-bit quantization.
  • Benefit: Like GPTQ, it requires no retraining and maintains strong zero-shot task performance for language models.
05

SmoothQuant

SmoothQuant is a PTQ technique that addresses the challenge of quantizing transformer models with large, outlier values in their activations, which are difficult to represent in low-precision integers.

  • Problem: Outliers in activations (common in models like OPT and BLOOM) force high quantization error if activations are quantized directly.
  • Solution: Mathematically migrates the quantization difficulty from activations to weights by 'smoothing' the activation scales via a per-channel scaling factor absorbed into the preceding layer's weights.
  • Outcome: Enables 8-bit quantization of both weights and activations (W8A8) for full transformer inference, which is highly efficient on modern integer hardware.
06

Calibration Dataset & Metrics

The calibration dataset is a small, representative set of unlabeled data (typically 128-512 samples) used by PTQ algorithms to determine optimal quantization parameters.

  • Purpose: Used to observe the statistical range (min/max) of activations for static quantization or to compute Hessian information for methods like GPTQ.
  • Key Metrics: PTQ success is measured by:
    • Task Accuracy Drop: The change in performance (e.g., perplexity, accuracy) on a benchmark after quantization. A drop of <1% is often considered successful.
    • Model Size Reduction: e.g., reducing a 16-bit (FP16) model to 8-bit (INT8) cuts the model size in half.
    • Inference Latency/Speedup: The reduction in compute time achieved by using integer arithmetic on supporting hardware (e.g., CPUs, NPUs).
POST-TRAINING QUANTIZATION

Frequently Asked Questions

Post-training quantization is a critical compression technique for deploying models on resource-constrained hardware. These FAQs address its core mechanisms, trade-offs, and practical implementation.

Post-training quantization is a model compression technique that reduces the numerical precision of a pre-trained neural network's weights and activations (e.g., from 32-bit floating-point to 8-bit integers) without requiring further gradient-based training. The process uses a small, representative calibration dataset to estimate the dynamic range (min/max values) of activations, enabling the calculation of scale and zero-point parameters that map floating-point values to integer representations. This drastically reduces the model's memory footprint and accelerates inference on hardware that natively supports integer arithmetic, such as CPUs, GPUs, and specialized Neural Processing Units.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.