Quantization is the process of mapping a continuous range of high-precision values (e.g., 32-bit floating-point numbers) to a discrete set of lower-precision values (e.g., 8-bit integers). This fundamental precision reduction directly shrinks the model size, accelerates inference by enabling faster integer arithmetic on hardware like CPUs and NPUs, and reduces power consumption. It is a critical technique for deploying large models in resource-constrained environments, such as mobile devices and edge computing. Common targets include converting from FP32 to INT8 or even INT4 precision.
