by Michele Laurelli
Transformer architecture adapted for computer vision by treating image patches as tokens.
Divides images into patches (16x16), flattens and projects them, adds positional encoding. Pure transformer without convolutions. Requires large datasets.
Image classification
Alternative to CNNs
ViT-Base, ViT-Large