Modality alignment is the process of ensuring that representations from different data types—such as text, images, audio, and video—correspond to the same semantic concepts within a shared latent space. This is typically achieved through training objectives like contrastive learning (e.g., InfoNCE loss) or supervised learning on paired data, forcing embeddings of related concepts (like "dog" in text and a picture of a dog) to be close together in the vector space while pushing unrelated ones apart. The resulting unified embedding space enables cross-modal retrieval, translation, and reasoning.
