N:M sparsity is a structured pruning pattern where, for every contiguous block of M weights, at most N are allowed to be non-zero. This constraint creates a predictable, regular sparsity structure that enables highly efficient execution on specialized hardware like NVIDIA's Ampere architecture and its Sparse Tensor Cores. Unlike unstructured sparsity, which creates irregular zero patterns requiring expensive index tracking, N:M sparsity allows for dense packing and computation, often achieving near-dense throughput with a fraction of the operations.




