AI Blog

AI Blog

by Michele Laurelli

Multimodal AI

/ˌmʌltɪˈmoʊdəl/
Concept
Definition

AI systems that can process and relate information from multiple modalities like text, images, audio, and video.

Multimodal models learn joint representations across modalities. Examples include CLIP (vision-language), Flamingo (visual question answering), and GPT-4V (vision understanding).

Examples

1

CLIP matching images to text

2

GPT-4V describing images

3

Image captioning systems

Michele Laurelli - AI Research & Engineering