Neural network pruning is a model compression technique that removes less important parameters—individual weights, entire neurons, or full layers—from a trained network to reduce its size and computational footprint while aiming to preserve its original accuracy. The process typically involves training a large, dense model, evaluating the salience of each parameter (e.g., via magnitude or gradient-based metrics), and then iteratively removing the least salient ones, often followed by fine-tuning to recover performance. The result is a sparse network that requires less memory and enables faster inference.
