Multi-modal retrieval is the process of searching a unified index to find relevant information across diverse data types—such as text, images, audio, and video—based on a query that can itself be in any of those modalities. This is achieved by mapping all data into a shared, high-dimensional embedding space where semantic similarity can be measured using metrics like cosine similarity. The core challenge is creating cross-modal embeddings that allow a text query, for example, to find semantically related images or audio clips.
