by Michele Laurelli
Masking technique preventing attention to future positions in autoregressive models.
Ensures token at position i can only attend to positions ≤ i. Essential for GPT-style models. Implemented as triangular mask.
GPT autoregressive generation
Decoder self-attention
Preventing future leakage