Cross-attention is a neural network mechanism where a sequence of queries from one data source dynamically attends to and aggregates information from a sequence of keys and values derived from a separate, distinct source. This enables the fusion of information across disparate modalities—such as text and images—or contexts, allowing a model to condition its processing on relevant external data. It is a fundamental component in architectures like the Perceiver, Flamingo, and latent diffusion models such as Stable Diffusion.
