The CLIP model is a neural network developed by OpenAI that learns visual concepts from natural language supervision. It is trained on a massive dataset of image-text pairs using a contrastive learning objective, specifically the InfoNCE loss. This objective teaches the model to pull together the vector representations of matching images and text while pushing apart non-matching pairs, creating a unified embedding space where semantically similar concepts are close regardless of modality.
