StreamingLLM is a framework enabling language models trained on a finite context window to process infinite-length text streams without fine-tuning. It achieves this by identifying and leveraging attention sinks—the initial tokens of a sequence that receive disproportionately high attention scores—to stabilize the attention mechanism. The framework maintains a fixed-size cache using a sliding window of recent tokens combined with these critical initial tokens, allowing for constant memory usage and computational cost regardless of stream length.
