The Perceiver architecture is a transformer-based neural network designed to process data from any modality—such as text, images, audio, or point clouds—by first projecting high-dimensional inputs into a fixed-size latent bottleneck. This bottleneck is then processed by a deep stack of transformer blocks that alternate between cross-attention layers, which attend to the input array, and self-attention layers, which reason within the latent space. This design decouples computational complexity from input size, enabling efficient handling of very long sequences or high-resolution data.
