Flamingo Architecture is a visual language model (VLM) framework that enables few-shot learning on multimodal tasks by integrating a pre-trained, frozen vision encoder with a frozen large language model (LLM) using novel gated cross-attention layers.
The architecture works through a multi-stage process:
- Visual Feature Extraction: A frozen vision encoder (like a Vision Transformer or ResNet) processes input images or video frames into a sequence of visual tokens.
- Perceiver Resampler: This component acts as a learned bottleneck, using a fixed number of latent queries and cross-attention to condense the variable-length visual token sequence into a fixed, manageable number of visual tokens.
- Gated Cross-Attention Integration: This is the core innovation. The processed visual tokens are interleaved with text tokens from the input prompt. At specific layers within the frozen LLM, gated cross-attention (Xattn-D) layers are inserted. These layers allow the LLM to 'attend' to the visual context. A learned gating mechanism controls the influence of the visual information on the text generation process.
- Conditional Text Generation: The LLM, now conditioned on the visual context via the cross-attention gates, generates the textual output (e.g., an answer to a visual question).
By keeping the core vision and language models frozen, Flamingo achieves strong performance with minimal new parameters, enabling efficient adaptation to new multimodal tasks with just a few examples.