An attention sink is a phenomenon, identified in the StreamingLLM framework, where the initial tokens of an input sequence (often the first four) receive a disproportionately large share of the model's attention scores, regardless of their semantic relevance. This occurs because the softmax function in the attention mechanism requires a distribution across all positions, and these initial tokens act as a stable "sink" for residual attention probability. Exploiting this by always keeping these initial tokens in the KV Cache allows models trained on finite contexts to process text streams of infinite length without fine-tuning.
