Sliding window attention is an efficient transformer attention mechanism where each token can only attend to a fixed-size window of the most recent tokens preceding it, rather than the entire sequence. This design enforces a locality bias, assuming that the most relevant context for predicting the next token is found nearby. By restricting the attention span, it achieves a constant O(n) memory and computational cost for sequences of arbitrary length n, making it scalable for long-context tasks like document processing or continuous dialogue.
