by Michele Laurelli
A multimodal model trained to understand relationships between images and text.
CLIP learns joint embeddings of images and text through contrastive learning on 400M image-text pairs. It enables zero-shot image classification and powers many vision-language applications.
Zero-shot image classification
Image search by text
Text-to-image generation conditioning