Self-Attention: why the idea of "stochastic parrots" no longer holds up

Self-attention as a structural operator

The self-attention mechanism is not designed to imitate distributions, but to construct structures. Each token, at every layer, re-reads itself in the context of the entire sequence, reevaluates the syntactic and semantic ties that connect it to others, and updates its internal representation based on a dynamic informational topology.
This dynamic allows for the creation of global interpretations that are not contained in the original sequence and cannot be obtained through a simple autoregressive statistical process.

A purely statistical model operates through interpolation: it combines what it has seen according to its probability distribution.
The Transformer operates through transformation: it restructures the representation of the sequence into new conceptual objects.
This is where emergence is generated: in the iterated process of internal semantic rewriting, not in the probabilistic dimension of decoding.

Emergence as a property of the dynamic graph

The unexpected capabilities of LLMs are neither magic nor linguistic epiphenomena. They are the direct result of the deep composition of attention layers, the specializing role of different heads, and the optimization of non-linear functions that organize representations over latent varieties of extremely high dimensions.
One layer isolates local relationships, the next recontextualizes them, and another abstracts them into functional patterns. Depth does not simply add “more patterns,” but introduces new internal symmetries and new functions that no dataset explicitly contains.

This is why models solve unseen problems, generate new concepts, or perform reasoning in procedural forms. There is no need to hypothesize “strong intelligence”; it is sufficient to recognize that a sufficiently complex computational graph can implement operations akin to forms of inference.

Except: what the “stochastic parrot” metaphor cannot explain

There are observable, empirical, verifiable phenomena that the parrot metaphor fails to justify.
A model that repeats statistical patterns cannot produce valid solutions to problems that do not exist in the dataset. It cannot reformulate concepts into new configurations without having seen them before. It cannot develop mechanisms for internal decomposition of the problem. It cannot establish conceptual invariants over multimodal inputs.
The metaphor fails the moment the model's behavior exceeds what its linguistic distribution could realistically justify.
The very existence of zero-shot capabilities, layered reasoning, and out-of-distribution generalization renders the definition obsolete.

Transformers as systems of differentiable computation

Transformers are not statistical models disguised as language models. They are systems of differentiable computation, capable of implementing complex functions that emerge from the composition of layers. With sufficient depth, precision, and parameterization, the architecture is Turing-complete and thus capable of representing transformations of an algorithmic nature.
Self-attention acts as an abstract parsing operator, constructing functional structures that have no direct counterparts in the dataset.
The model does not merely predict “the next most probable token”: it interprets, transforms, reconstructs, and updates the internal semantic space to converge towards an output consistent with a learned logical structure.

To say that such a system “repeats patterns” is like saying that a compiler “copies text”: a simplification that becomes false as soon as one observes the internal behavior.

Why the metaphor survives

The cultural survival of the stochastic parrot arises from the superficial similarity between the language generated by an LLM and human language. If the output is language, then it seems natural to believe that the process is also linguistic. But the Transformer is not a linguistic processor: it is a representation transformer.
The probabilistic generation of output obscures the deterministic and structural nature of the internal mechanism, leading the observer to think that probability is “all there is.” In reality, probability is just the last centimeter of a deeply computational path.

To seriously discuss AI in 2025, we need to change our vocabulary. Journalistic metaphors have had their time; now more precise concepts are needed.
Self-attention is not a statistical device that imitates texts: it is a structural operator that builds abstract representations.
LLMs do not work because they are “very large”: they work because the architecture allows for the formation of internal structures that emerge from non-linear dynamics that are difficult to reduce to past concepts.

We do not need exaggerations or minimizations.
What is needed is a description that adheres to the facts: Transformer models are not stochastic parrots. They are systems of differentiable computation that have introduced a new, powerful form of knowledge representation.