Structured sparsity is a model compression paradigm where neural network weights are pruned according to predefined, hardware-friendly patterns—such as entire channels, blocks, or a 2:4 ratio of non-zero to zero values—instead of removing individual, scattered weights. This structured removal creates contiguous blocks of zeros that can be leveraged by specialized hardware and libraries for sparse matrix multiplication, delivering significant speedups and memory savings without the irregular memory access patterns of unstructured sparsity.
