Modality-agnostic encoding is a method for processing and representing data from various input types using a single, shared model architecture, abstracting away the specifics of the original modality. The core mechanism involves projecting raw inputs from different sources into a unified embedding space—a common vector representation—where semantically similar concepts are close together regardless of their format. This is often achieved through an initial projection layer that maps modality-specific features into a shared dimensionality, followed by a transformer-based backbone (e.g., a Perceiver or a model with cross-attention) that processes these aligned representations. The goal is to create a shared latent space where a query in one modality can retrieve relevant information from another, enabling tasks like cross-modal retrieval and reasoning without modality-specific model branches.
