AI Blog

by Michele Laurelli

ViT (Vision Transformer)

Architecture

Definition

Transformer architecture adapted for computer vision by treating image patches as tokens.

Divides images into patches (16x16), flattens and projects them, adds positional encoding. Pure transformer without convolutions. Requires large datasets.

Examples

Image classification

Alternative to CNNs

ViT-Base, ViT-Large

Related Terms

Transformer

A neural network architecture based entirely on attention mechanisms, without recurrent or convolutional layers.

Computer Vision

A field of AI enabling computers to derive meaningful information from visual inputs like images and videos.

Michele Laurelli - AI Research & Engineering