AI Blog

AI Blog

by Michele Laurelli

ViT (Vision Transformer)

Architecture
Definition

Transformer architecture adapted for computer vision by treating image patches as tokens.

Divides images into patches (16x16), flattens and projects them, adds positional encoding. Pure transformer without convolutions. Requires large datasets.

Examples

1

Image classification

2

Alternative to CNNs

3

ViT-Base, ViT-Large

Michele Laurelli - AI Research & Engineering