CLIP (Contrastive Language-Image Pre-training) is a multimodal neural network architecture developed by OpenAI that learns a joint embedding space for images and their corresponding text captions. It is trained using a contrastive learning objective on hundreds of millions of internet-sourced image-text pairs, teaching the model to pull matching pairs closer together in the vector space while pushing non-matching pairs apart. This process creates aligned representations where semantically similar concepts across modalities reside nearby.
