by Michele Laurelli
AI systems that can process and relate information from multiple modalities like text, images, audio, and video.
Multimodal models learn joint representations across modalities. Examples include CLIP (vision-language), Flamingo (visual question answering), and GPT-4V (vision understanding).
CLIP matching images to text
GPT-4V describing images
Image captioning systems