Generative engines decide who you are, tell them yourself
"The deductions made by models can cause significant harm to personal identities as well as those of companies and products."
The mechanism that changes the rules
For a few weeks now, I have been working with data that has broadened my priorities. Training crawlers have accounted for 49.9% of all AI traffic on websites according to Cloudflare Radar, and Applebot has grown by 124% in a single month. This is already half of what arrives today.
Google-Agent is Google's new user-triggered fetcher: user-initiated requests do not follow the same logic as autonomous crawlers, and the robots.txt in this category does not necessarily apply. Anyone managing AIO infrastructure must update their mental model regarding robots.txt.
The crux is how models build the representation of an entity. Their logic is not that of a relational database: they build connections based on proximity, not truth. If a product and a competitor often appear in the same contexts, the model may associate them. If the 2021 documentation is linked more than that of 2024, it may weigh more in the synthesis. If a source is the only one available on a specific feature, it tends to become dominant even when it is old or incomplete.
A generative system reads a page differently than a browser sees it, extracting the raw content and passing it to the model. In that transition, elements invisible to the user, such as comments in the code, accessibility attributes, and hidden metadata, become visible to the model and, if built ad hoc, can influence it or be exploited. It is a documented attack surface, not hypothetical.
Why SEO serves as a foundation, but not a roof
Traditional SEO is still necessary because without solid organic positioning, generative engines cannot find the material from which to synthesize. However, Googlebot and generative engines look for different things. Googlebot indexes pages and ranks them for relevance to textual queries. Generative engines seek declared semantic identity: who is this subject, what do they do, who are they connected to, how do they want to be represented, which sources about them are reliable (according to the bot itself), and how they relate to each other.
In working on these cases, the errors I find are quite systematic. Undisambiguated entities, with a SaaS confused with a competitor because neither has declared the differences in a format that models recognize as priority. Relationships built on co-occurrence, with founders attributed to the wrong products and partnerships inferred from close mentions in the same article. Obsolete features weighing as much as current ones because old documentation accumulates more backlinks than new. None of these errors are extravagant: they follow the logic of a system that knows not reality, but only the available data and their convergence.
There are technical architectures to communicate identity to models, some already in use and others still being defined in the industry. In my work, I have been implementing them for some time, often on standards that the market is still discussing, using Strategic Metaprompting to analyze how models read and represent an entity, and orient the infrastructure accordingly. For many Italian tech products, this infrastructure is still absent or embryonic. The market has focused on what Google has made measurable for years, while generative engines use different metrics.
Competing standards and lying bots
Those working on this front encounter an additional problem: the territory is not only new, it is unstable. Standards are not unified, big tech companies are each pushing in their own direction, and the landscape changes every month.
For example, Google is experimenting with web-bot-auth, an IETF draft that uses cryptographic signatures to authenticate bots instead of relying on easily spoofable User-Agent strings. AWS has already announced experimental support, and other operators are moving in the same direction. It is a protocol to monitor, not yet to implement in production, but it could change how bot access is managed at the server level. Not all bots wait for standards. According to traffic analyses published by Cloudflare and other operators, some crawlers have been observed using strings that mimic mobile browsers or bypass declared blocks in robots.txt. The robots.txt is a legal signal, not a technical control, and real enforcement requires a different layer.
Google's A2A and Anthropic's MCP, both governed by the Linux Foundation, are establishing themselves as references for agent infrastructure, but the landscape remains fragmented and evolving. Those working on AIO and GEO do not implement once and finish; they maintain an infrastructure that shifts beneath their feet, week after week. And I know, you must continuously have your hands in the dough and frequently revisit the web pages you prepare.
This is the real cost of being a pioneer. Most of the market waits for something to solidify. But when it solidifies, the ground is already occupied by those who built first, with declared entities, structured relationships, infrastructures that models already know how to read.
Those who still do not know what AIs say about their product, how they describe it, and whether the relationships are correct, let others decide. And often, they are not human.
