A unified embedding space is a single, shared vector representation where data from multiple modalities—such as text, images, and audio—is encoded, enabling direct semantic comparison and retrieval across different data types. This is achieved by training models, often using contrastive learning objectives like InfoNCE loss, to project diverse inputs into a common latent space where similar concepts are close together regardless of their original format. The resulting modality-agnostic encoding is critical for tasks like cross-modal retrieval and visual question answering.
