Quantization is a model compression technique that reduces the numerical precision of a model's weights and activations—for example, from 32-bit floating-point (FP32) to 8-bit integers (INT8)—to decrease memory usage and increase computational speed. This process maps a larger set of continuous values to a smaller set of discrete levels, trading a minimal amount of model accuracy for substantial gains in efficiency, making deployment on resource-constrained edge devices or in high-throughput server environments feasible.
