Inferensys

Glossary

GPTQ (GPT Quantization)

GPTQ is a post-training quantization algorithm that uses second-order information to compress transformer model weights to 4-bit or lower precision with minimal performance degradation, enabling efficient deployment on edge hardware.
Engineer deploying small language model to edge device, IoT sensor visible on desk, technical hardware setup in bright workspace.
MODEL COMPRESSION

What is GPTQ (GPT Quantization)?

GPTQ is a state-of-the-art post-training quantization method designed to compress large transformer-based language models for efficient inference.

GPTQ (GPT Quantization) is a post-training quantization algorithm that compresses transformer model weights to ultra-low precision—typically 4-bit or lower—with minimal accuracy loss, enabling efficient deployment on memory-constrained hardware. It operates layer-by-layer, using second-order information from the Hessian matrix to correct quantization errors, making it significantly more accurate than simpler rounding methods. This technique is a cornerstone of parameter-efficient fine-tuning and edge AI deployment strategies.

The method's core innovation is its optimal brain quantization approach, which treats weight compression as a layer-wise reconstruction problem. By preserving the most impactful weights with higher precision, GPTQ maintains model performance while achieving drastic reductions in model size and memory bandwidth. It is directly related to other compression techniques like AWQ and SmoothQuant, and is essential for enabling on-device inference of models like Llama and Mistral.

POST-TRAINING QUANTIZATION

Key Features of GPTQ

GPTQ is a state-of-the-art post-training quantization method that compresses transformer models to 4-bit or lower precision with minimal accuracy loss. Its core innovation lies in using second-order information for highly accurate, layer-wise compression.

01

Layer-Wise Quantization

GPTQ quantizes the model one layer at a time, using the preceding layers' outputs to calibrate the quantization of the current layer. This sequential, greedy approach minimizes the propagation of quantization error through the network, which is critical for maintaining the performance of deep transformer architectures.

  • Process: The algorithm freezes all layers except the one being quantized.
  • Calibration: Uses a small, representative dataset to observe activation patterns.
  • Benefit: Achieves higher accuracy than one-shot quantization of the entire model.
02

Hessian-Based Weight Selection

The method's accuracy stems from using the Hessian matrix (a matrix of second-order derivatives) to identify which weights are most sensitive to quantization error. GPTQ approximates the Hessian with respect to the layer's weights, providing a precise measure of each weight's importance to the overall output.

  • Second-Order Information: More accurate than first-order (gradient) methods for determining sensitivity.
  • Optimal Brain Quantization (OBQ): GPTQ is based on the OBQ framework, adapted for massive models.
  • Outcome: Allows aggressive quantization (e.g., to 4-bit) while protecting the most critical weights.
03

Integer-Only Deployment

A primary goal of GPTQ is to enable efficient integer arithmetic on hardware. By quantizing weights to very low precision (e.g., INT4) and often quantizing activations to INT8, it reduces memory bandwidth and allows for faster computation on specialized hardware like NPUs and GPUs with integer cores.

  • Memory Footprint: Reduces model size by 4x (FP16 to INT4) or more.
  • Inference Speed: Integer operations are significantly faster than floating-point on many accelerators.
  • Compatibility: Quantized models are served by runtimes like GPTQ-for-LLaMA, AutoGPTQ, and vLLM.
04

Trade-Offs: Group Size & Accuracy

GPTQ introduces a group size hyperparameter that controls the granularity of quantization. Weights within a layer are partitioned into blocks (groups), and each group has its own quantization scale factor. This creates a key trade-off:

  • Smaller Group Size (e.g., 128): Higher accuracy, more scale factors, slightly increased overhead.
  • Larger Group Size (e.g., 1024): Lower accuracy, fewer scale factors, maximized compression.
  • Practical Use: A group size of 128 is a common default, providing a near-lossless 4-bit compression for many models.
05

Comparison to AWQ & SmoothQuant

GPTQ is one of several leading PTQ methods, each with distinct strategies:

  • vs. AWQ (Activation-aware Weight Quantization): AWQ protects weights that are multiplied by large activation magnitudes. GPTQ uses Hessian information. AWQ is often faster to apply; GPTQ can be more accurate but is computationally heavier.
  • vs. SmoothQuant: SmoothQuant mathematically "smoothes" outlier activations to enable easy 8-bit quantization of both weights and activations. GPTQ primarily targets extreme weight quantization (to 4-bit).
  • Use Case: Choose GPTQ for maximal weight compression where calibration compute is available.
COMPARISON

GPTQ vs. Other Quantization Methods

A feature comparison of GPTQ against other prominent post-training and training-time quantization techniques used for compressing large language models.

Feature / MetricGPTQAWQ (Activation-aware)SmoothQuantQuantization-Aware Training (QAT)

Core Methodology

Layer-wise Hessian-based weight rounding

Activation-guided weight scaling

Mathematical smoothing of activation outliers

End-to-end training with simulated quantization

Primary Use Case

Post-training weight quantization (4-bit and below)

Post-training weight quantization (4-bit)

Post-training quantization of weights & activations (8-bit W8A8)

Training or fine-tuning models for subsequent integer deployment

Calibration Data Required

Small, unlabeled sample (128-512 examples)

Small, unlabeled sample

Small, unlabeled sample

Full training dataset for the target task

Typical Weight Precision

2-bit, 3-bit, 4-bit, 8-bit

4-bit

8-bit (weights and activations)

4-bit, 8-bit (after deployment)

Activation Quantization

Computational Overhead

Moderate (layer-wise optimization)

Low (per-channel scaling)

Low (offline scaling factors)

High (full training loop)

Performance Preservation (vs. FP16)

Excellent for 4-bit, good for lower bits

Excellent for 4-bit

Near-lossless for 8-bit

Best possible, learned for target precision

Hardware Support

Widely supported via kernels (e.g., EXL2, AutoGPTQ)

Growing kernel support

Native support in many inference engines

Dependent on framework (e.g., TensorRT, TFLite)

IMPLEMENTATION ECOSYSTEM

Frameworks and Tools Supporting GPTQ

GPTQ is implemented through a specialized ecosystem of libraries and compilers designed to integrate quantized models into production workflows. These tools handle the quantization process, runtime execution, and hardware acceleration.

04

TensorRT-LLM & vLLM

High-performance inference engines like TensorRT-LLM (NVIDIA) and vLLM have added support for GPTQ-quantized models to maximize throughput and reduce latency in production serving.

  • TensorRT-LLM: NVIDIA's toolkit compiles models for optimal execution on Tensor Cores. It includes a GPTQ plugin that leverages highly optimized kernels for 4-bit weights, achieving peak performance on NVIDIA GPUs.
  • vLLM: Known for its innovative PagedAttention algorithm, vLLM supports GPTQ to increase the number of models or concurrent requests that can be served per GPU. It focuses on efficient attention and memory management for quantized weights.
  • Use Case: Essential for deploying quantized models in high-demand API endpoints or batch inference services.
23x
TensorRT-LLM GPTQ vs. FP16 speedup (A100)
GPTQ

Frequently Asked Questions

GPTQ is a leading post-training quantization method for compressing large language models. These questions address its core mechanics, applications, and how it compares to other techniques.

GPTQ (GPT Quantization) is a post-training quantization method that compresses transformer model weights to 4-bit or lower precision with minimal accuracy loss. It works by applying layer-wise quantization using second-order information from the Hessian matrix to correct the error introduced when rounding weights to lower precision.

The algorithm processes the model one layer at a time. For each layer, it quantizes groups of weights (e.g., 128 columns at a time) to INT4 or INT3. It uses the Hessian—which approximates the model's curvature and sensitivity to changes—to update the remaining, unquantized weights to compensate for the error caused by quantizing the current group. This Hessian-informed update ensures the overall output of the layer is preserved as accurately as possible, making GPTQ exceptionally effective for compressing models like LLaMA and GPT-2 without additional training.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.