by Michele Laurelli
Computed similarity between query and key vectors before softmax normalization in attention.
Score = dot_product(Query, Key) / sqrt(d_k). Scaled dot-product prevents gradient issues. Higher scores mean more relevance.
Scaled dot-product attention
Query-key similarity
Attention computation
A technique allowing models to focus on specific parts of the input when producing output.
Three vectors used in attention mechanisms to compute weighted combinations of input elements.
An activation function that converts a vector of values into a probability distribution summing to 1.