Translating Cobol is not translation: the AI architecture behind legacy code modernization
"Why the conversion of legacy applications is not a matter of linguistic translation but of semantic reconstruction, and how we addressed the issue at Scriba."
There is a claim that has been circulating insistently over the past year, and it deserves to be defused before anything else: modern LLMs are now capable of converting legacy code into modern code. It is the implicit premise of every demo showing GPT or Claude transforming a block of Cobol into functioning Python in three seconds. It is the assumption behind Anthropic's announcement in February 2026, which caused IBM to lose 13% of its stock value in a single day.
The claim is not false. It is incomplete in a way that makes it, operationally, untrue.
A generalist LLM translates a block of Cobol as it translates a poem from French to English: preserving the literal meaning line by line, systematically losing what that block means within a three-hundred-thousand-line application written by twenty different developers over thirty years, with implicit dependencies on JCL systems, calls to subprograms for which there is no longer any documentation, hardcoded constants that encode tax regulations from 1994 still applicable to a subset of clients. The result is plausible, compilable code, superficially correct — and functionally dangerous.
The problem of modernizing legacy code is not a translation problem. It is a problem of semantic archaeology: reconstructing the intention of a system from the only artifact left, the code itself, when all others (specifications, documentation, authors, organizational context) have been lost.
This article describes the architecture we have built at Scriba to address this problem. It is not a product presentation. It is a reflection on what is structurally needed for an AI system to reliably convert legacy code and why a single frontier model, no matter how capable, is not the right architecture.
The problem, in its real form
Before discussing solutions, it is worth clarifying what legacy code really means. In the strict technical sense, it is code written in now marginal languages — Cobol, RPG, Fortran, PL/I, Natural, IBM Assembler, Pro*C — that continues to perform business-critical functions. In the operational sense, it is something more: it is code whose context has been lost.
A banking application in Cobol typically exhibits all these characteristics together:
Historical stratification: portions written in the 1980s, patched in the 1990s, extended in the 2000s, each layer with different conventions and comments in the languages and styles of those who worked during that period.
Opaque dependencies: calls to external programs identified only by numeric codes, control files read from hardcoded paths, JCL environment variables that modify behavior in undocumented ways.
Implicit business logic: rules that are not written anywhere in the specifications — because the specifications no longer exist, or never existed — but emerge from the interaction of dozens of conditions scattered throughout the code.
Deliberate but obscure optimizations: idiomatic patterns of the language (the use of
REDEFINESin Cobol,OCCURS DEPENDING ON, packed decimal fields) that only make sense if one knows the philosophy of the language and the constraints of the original hardware.
Taking a source file of this kind and feeding it to an LLM with the prompt "convert this to Java" is an operation that produces syntactically valid output with high probability and semantically correct output with a probability that cannot be estimated. And in the domains where Scriba operates, banking, insurance, logistics, defense, "cannot be estimated" is a functional synonym for "unacceptable".
The solution, therefore, is not a bigger model. It is an architecture that makes the problem tractable.
Why a single LLM is not enough
There is a temptation, in the face of a complex problem involving natural language and programming language, to think that a sufficiently capable model is all that is needed. It is the same temptation that has dominated the debate on AI agents for years: if GPT-4 can't do it, GPT-5 will. The last two years of applied work on production agent systems have shown that this is the wrong direction, and the reasons are technical, not ideological.
The first problem is context. A real legacy application does not fit within a context window. A single medium-sized Cobol module is around two thousand to five thousand tokens; a complete application can easily exceed a million. Even with the long-context models of 2026, the degradation of attention on extended inputs, the lost in the middle phenomenon now documented in dozens of papers, makes a monolithic approach impractical.
The second problem is specialization. The sub-tasks involved in code conversion are qualitatively different from one another: identifying the syntactic structure of a Cobol file, reconstructing the control flow, mapping data structures to modern equivalents, detecting peculiar idioms of the source language, writing idiomatic code in the target language, generating tests that verify behavioral equivalence. No single model is optimal for all these tasks. A generalist LLM is a good target code generator, but it is a mediocre Cobol parser because Cobol tokens are underrepresented in the training and its idioms are, for the LLM, statistical noise.
The third problem is verifiability. A monolithic system produces a single opaque output: it works, or it doesn't work. A structured pipeline produces inspectable intermediate outputs (intermediate representations, reconstructed documentation, explicit assertions on function contracts) that can be validated separately, corrected when wrong, and that allow for damage containment when something goes wrong.
For these three reasons, Scriba's architecture is not a model. It is a system.
Architecture: the pipeline
Scriba is structured as a ten-stage pipeline, each of which encapsulates a well-defined transformation on the artifact being processed. The stages are sequential in the sense of the main flow, but each has the ability to request a return to previous stages when it detects conditions of uncertainty. It is, in this sense, a pipeline with controlled backtracking, not a pure DAG.
I will describe here the five conceptually most relevant stages, grouping the rest for brevity.
Stage 1: Parsing and intermediate representation
The first problem is to reduce the source code to a representation that does not depend on the specific language. We do not use an AST of the source language: a Cobol AST does not map to a Java AST, and any attempt at AST-to-AST conversion introduces structural losses. We use a higher-level intermediate representation that captures semantic intentions rather than syntactic constructs: "this function reads a record from an indexed file and validates it against a set of business rules". This representation is extracted from the source code using deterministic parsers (for structure) and specialized AI models (for the semantics of portions that the parsers cannot classify).
The advantage of this approach is that the target language never enters this phase. The intermediate representation is the same for any source-destination pair, and this allows us to support more than thirty languages with a linear number of combinations, not quadratic.
Stage 3: Reconstructing business logic
This is the most delicate stage, and the one that distinguishes Scriba from a simple transpiler. Given the source code and its intermediate representation, the system attempts to reconstruct why the code does what it does — not just what it does.
The difference is concrete. A transpiler, faced with a condition IF CODICE-CLIENTE = 'X1' THEN PERFORM ROUTINE-SPECIALE, produces the equivalent version in the target language and stops. Scriba seeks to answer an additional question: what does the code 'X1' mean in business terms? The answer "institutional clients with contracts prior to 2003, subject to specific tax regulations" may not be reconstructible from the code alone, but is often reconstructible from the context: from the names of surrounding variables, from comments scattered in related modules, from the structure of the customer table referenced elsewhere.
This stage uses AI models with structured prompting, but it is not a model: it is an iterative process in which the system formulates hypotheses, verifies them against the rest of the codebase, and revises them when it finds contrary evidence. The result is reconstructed documentation, often the first accurate documentation that application has ever had, which is delivered to the client along with the converted code.
Stage 5: Target code generation
Only at this point does generation occur. And it does not occur on the entire application, but function by function, with context enriched by the intermediate representation, the reconstructed business logic, the idiomatic conventions of the target language, and the client's specifications (coding style, frameworks used, preferred architectural patterns).
The generation uses frontier AI models: here an LLM is the right choice, because the task is exactly what these models excel at: producing idiomatic code in a modern language given a rich context. But the prompt it receives is not "convert this Cobol to Java". It is a structured document that includes the reconstructed functional specification, interface constraints, expected call examples, and style guidelines. It is the equivalent of a well-written development ticket, not a translation request.
Stage 7: Generating equivalence tests
The generated code, as idiomatic as it may be, has no value if it is not behaviorally equivalent to the original. This stage automatically generates a suite of tests that compare the output of the original code and the new code on synthetic inputs generated from the real data structures of the client. It is a form of differential testing that becomes the main validation tool.
Where tests fail, the system does not simply report the problem: it uses the failure as a signal to return to previous stages and revise the conversion hypothesis. This is the controlled backtracking mechanism I mentioned above.
Stage 9: Reporting unresolved hotspots
Not all legacy code can be converted completely automatically. There are portions, typically those that depend on specific hardware behaviors, or that implement business logic for which there is no longer any reconstructible trace, that require human intervention.
The architectural value of Scriba is not in pretending to convert 100% of the code automatically. It is in knowing when it does not know. One of the most important design requirements is that the system never silently produces plausible but incorrect code. When uncertainty exceeds a quantifiable threshold, the application isolates the problematic portion, documents the hypotheses it has tried and failed, and reports it to the human developer as an hotspot to be resolved manually.
This is, from the client's perspective, the single most important advantage over using a generalist LLM. An LLM does not know that it does not know: it always produces output. Scriba quantifies its uncertainty and makes it visible.
The non-technical constraint that shaped the system: private-AI
There is a constraint that does not come from AI architecture theory but from the reality of the sectors in which we operate: client code cannot leave the client's perimeter. A bank does not send its core banking to OpenAI. An insurance company does not send its policy management system to Anthropic. A defense company does not even start the conversation if it begins with "first, upload the code to our cloud".
This constraint has a radical architectural consequence: Scriba must be able to operate on-premise, with models running within the client's perimeter. In practice, this means that we cannot rely solely on closed frontier LLMs, and that the pipeline must be designed to work with a combination of smaller models, executable on reasonable hardware, plus possibly a frontier model called only on the portions that truly require its power and called in modes that guarantee data non-retention.
The unexpected advantage of this constraint is that it has forced a better architecture. If we could have started from the assumption "we have unlimited access to GPT-5", we would probably have built a lazier and less structured system. The private-AI constraint has necessitated specialization, modularity, intelligent delegation among models of different scales, and these properties have proven virtuous not only from the perspective of result quality but also compliance.
I have written about this in more general terms in Building Private AI Systems: the data sovereignty constraint, when taken seriously from the outset, produces structurally better architectures than those born in the cloud.
Why this is an Italian problem, and why it matters
I conclude with an observation that goes beyond the technical but has weighed heavily in the design choices.
Legacy code is a global problem, but the Italian technological infrastructure experiences it with particular intensity. Our banks, our insurance companies, our public administration, our logistics disproportionately depend on legacy applications written since the 1970s and 1980s, often on IBM mainframes, often never replaced because the estimated costs of rewriting have always proven higher than the perceived value of the operation. It is a historical dependency that has kept the Country System bound to a technology provider and a development paradigm that are thirty years older than we are.
Scriba was born from the collaboration between Algoretico, the company I founded and of which I am CEO, and Lagiste23, the family office of Marco Landi (former President of Apple Computer). Algoretico brings the technological architecture and product direction. Lagiste23 brings the industrial vision and the international network built over forty years at the top of the industry. The operational CEO is Franco Mastrorilli. I lead the AI direction.
The ambition is not to build a product better than American competitors. It is to demonstrate that the problem of modernizing legacy code can be solved with an AI architecture designed here, that respects constraints, data sovereignty, complete audit, on-premise operation, which in cloud-first environments tend to be treated as costs to minimize rather than as project requirements.
The difference between a system that treats privacy as a market constraint and one that treats it as an architectural requirement is not ideological. It is measurable in the structure of the code, in model choices, in the guarantees you can offer to a CISO who knows how to read a DPIA. It is the difference between making a product and making infrastructure.
Scriba, as it is built, is on the side of infrastructure.
— ✦ —
The official site of Scriba is scriba-ai.dev.
