A Sparse Transformer is a neural network architecture designed to handle extremely long sequences by replacing the standard, computationally prohibitive self-attention mechanism with a sparse alternative. Instead of every token attending to all previous tokens—an O(n²) operation—it employs a fixed or learned pattern where each token only attends to a subset, dramatically reducing memory and compute requirements. This enables processing context windows orders of magnitude larger than standard Transformers, which is critical for tasks like long-form document analysis, agentic memory, and high-resolution image generation.
