AI Blog

AI Blog

by Michele Laurelli

CLIP (Contrastive Language-Image Pre-training)

/klɪp/
Model
Definition

A multimodal model trained to understand relationships between images and text.

CLIP learns joint embeddings of images and text through contrastive learning on 400M image-text pairs. It enables zero-shot image classification and powers many vision-language applications.

Examples

1

Zero-shot image classification

2

Image search by text

3

Text-to-image generation conditioning

Michele Laurelli - AI Research & Engineering