Post-Training Quantization (PTQ) is a model compression technique that reduces the numerical precision of a trained neural network's weights and activations (e.g., from 32-bit floating-point to 8-bit integers) to decrease memory footprint and accelerate inference, without requiring retraining. It applies a calibration process using a small, representative dataset to determine optimal scaling factors that map the float range to the integer range, minimizing accuracy loss from the precision reduction. This makes PTQ a fast, low-cost method for model deployment, especially on resource-constrained edge devices and hardware accelerators like Neural Processing Units (NPUs).
