Cross-modal embedding is a machine learning technique that maps data from different modalities—such as text, images, audio, and video—into a shared vector space. In this unified space, semantically similar concepts are positioned close together regardless of their original format, enabling direct comparison and retrieval across data types. This is the core mechanism behind systems like CLIP and is essential for agentic memory that can store and recall multimodal experiences.
