The Five Levels of AI in Software Development

The question most engineering leaders are asking is the wrong one.

They want to know which AI tools their teams should use, how to evaluate them, and how to build a roadmap for adoption. These are reasonable questions. They are also mostly irrelevant to whether the team will actually get faster. The teams pulling ahead are not winning because they picked the right tool. They are winning because they identified the right bottleneck — and it is not the one most organisations are working on.

The bottleneck is specification. The ability to describe what should exist precisely enough that a machine can build it, and to evaluate whether what got built actually works. Everything else — model capability, compute, tooling choice — is secondary. Most engineering teams are investing heavily in the secondary problem while the primary one goes unaddressed.

To understand why, it helps to look at where teams actually are.

The Five Levels

There are five levels of AI use in software development. They are not defined by which tools you have installed. They are defined by where the cognitive work happens — who or what is doing the thinking, and what kind of thinking that is.

Level 1 — Assisted. The developer writes code. AI suggests the next line, completes a function, explains an error. The developer evaluates every suggestion before accepting it. The cognitive model is unchanged: the developer is the author, AI is the autocomplete. The productivity gain is real but modest — less time spent on boilerplate, faster syntax lookup. Most teams reached Level 1 in 2023. It is now the baseline.

Level 2 — Accelerated. The developer describes a task in natural language and AI produces a working implementation. The developer reads the output, modifies it, integrates it. The cognitive model starts to shift: the developer is no longer writing code sentence by sentence, but reviewing and directing. This is where most teams currently operate, and it is also where the first serious problem emerges.

A rigorous randomised controlled trial published in mid-2025 found that experienced open-source developers working in their own codebases completed tasks 19% slower when using AI tools than without them. Not faster. Slower. More striking: they predicted AI would make them 24% faster, and after the study still believed it had made them 20% faster. The direction of the effect was wrong, and they did not notice.

The cause is not the tools. It is that Level 2 use inside Level 1 workflows creates drag. You are generating code faster than your review process can handle. The bottleneck moves upstream, and most teams do not move with it. They generate more, review more, slow down, and attribute the slowdown to something else.

Level 3 — Delegated. The developer writes a specification — requirements, constraints, expected behaviour — and hands the task to an agent that implements, tests, and iterates until the specification is met. The human reviews the outcome, not the code. The shift from Level 2 to Level 3 is not a tool upgrade. It is a workflow redesign. The specification becomes the primary work product.

This is where most teams claiming to be "AI-native" actually are not. They use Level 3 tools at Level 2 discipline. They give the agent a vague task, watch it generate code, and review line by line as if they had written it themselves. The tool is Level 3; the practice is Level 2; the result is Level 2 productivity with Level 3 complexity added to the review burden.

The teams genuinely operating at Level 3 have made specification into a craft. They write precise markdown. They define acceptance scenarios before implementation begins. They treat the agent's output the way a senior engineer treats a junior's PR — rigorous on outcomes, indifferent to methods.

Level 4 — Orchestrated. Multiple agents handle distinct parts of the development lifecycle. One takes requirements and produces a design. Another implements. A third writes tests. A fourth reviews for security. Humans define the workflow and approve transitions at gates they consider genuinely risky. The developer's role is architect and quality controller, not implementer.

This is rare but not theoretical. StrongDM runs three engineers on a software factory where no human writes or reviews code in the conventional sense. The system takes markdown specifications, builds software, tests it against behavioural scenarios, and produces shippable artifacts. The engineers approve outcomes. Anthropic's internal teams report 70–90% of Claude Code's codebase was written by Claude Code itself. The engineer leading the project has not written code personally in over two months.

Level 4 requires something most organisations do not yet have: test coverage and specification quality sufficient to trust an agent's judgment about whether something works. Without that foundation, Level 4 produces confident wrong answers at industrial scale.

Level 5 — Autonomous. The system maintains and evolves software with minimal human initiation. Engineers define goals and constraints; agents determine what to build, build it, verify it, and deploy it. The human role is strategic direction and exception handling. This level exists at the frontier of a small number of teams and remains more roadmap than reality for most organisations.

The Real Bottleneck

Most organisations trying to advance through these levels focus on the wrong constraint. They evaluate tools, run pilots, establish centres of excellence, and track adoption metrics. None of that is the limiting factor.

The limiting factor at Level 3 and above is specification quality — the ability to describe what should exist with enough precision, completeness, and testability that an agent can act on it without constant human intervention. This is not a prompting skill. It is a thinking skill. It requires clarity about what success looks like before a line of code is written. It requires the discipline to define acceptance criteria rather than approximate intent. It requires the kind of upstream rigour that most development processes treat as optional.

Most senior engineers are not good at this. Not because they lack the intelligence, but because their entire career has rewarded a different skill: the ability to hold ambiguity in their heads and resolve it through implementation. They got good at figuring out what something should be by building it. That approach fails at Level 3, because the agent will also figure out what something should be by building it — and the result will not be what you wanted.

The secondary constraint, and the one that organisations almost never address first, is structure. Sprint planning, code review, estimation models, team sizing — all of it was designed around humans writing code at human pace. When an agent produces a week of implementation in an afternoon, every downstream process that assumed human velocity breaks. Code review queues saturate. Estimation becomes meaningless. Sprint commitments disconnect from actual output. The org chart becomes friction before the tools do, and the tools get blamed for a problem the org chart created.

What Changes and What Doesn't

The nature of valuable engineering work is changing, not disappearing. At Level 3 and above, the most valuable thing an engineer can do is think clearly about what a system should do, decompose it into verifiable units, anticipate failure modes before implementation begins, and evaluate whether outcomes match intent. That is closer to systems design and product thinking than it is to programming.

The accumulated intuitions of experienced engineers — how to structure code, which patterns to use, how to navigate a complex codebase — become less differentiating at Level 3 and above. What becomes more differentiating is the quality of thought at the front of the process: how precisely a requirement is stated, how completely edge cases are anticipated, how clearly success is defined before work begins.

The teams that understand this are not just getting faster. They are doing fundamentally different work. Three engineers running a dark factory generating production software from markdown specifications are not a scaled-down version of a traditional engineering team. They are a different kind of organisation entirely.

Most engineering teams are nowhere near that. But the distance is not primarily a question of tools. It is a question of whether the team has developed the discipline to specify well — and whether the organisation around them has adapted to make that the central work, rather than a precursor to it.

That is the actual bottleneck. Most teams have not found it yet.

The Five Levels of AI in Software Development

The Five Levels

The Real Bottleneck

What Changes and What Doesn't

Talk to us about this