AI Blog

by Michele Laurelli

Back to Glossary

CLIP (Contrastive Language-Image Pre-training)

/klɪp/

Model

Definition

A multimodal model trained to understand relationships between images and text.

CLIP learns joint embeddings of images and text through contrastive learning on 400M image-text pairs. It enables zero-shot image classification and powers many vision-language applications.

Examples

Zero-shot image classification

Image search by text

Text-to-image generation conditioning

Related Terms

Multimodal AI

AI systems that can process and relate information from multiple modalities like text, images, audio, and video.

Zero-Shot Learning

A model's ability to perform tasks it wasn't explicitly trained on, using only task descriptions or examples.

Michele Laurelli - AI Research & Engineering