Deep compression is a systematic model compression technique, famously introduced by Han et al., that sequentially applies pruning, quantization, and entropy coding to achieve extreme reductions in neural network size. The process first removes redundant connections via pruning, then reduces the numerical precision of weights through quantization, and finally applies lossless compression like Huffman coding to the quantized weight indices and values. This pipeline is designed for deployment on memory- and power-constrained edge devices and mobile hardware.
