AI · Automation · Engineering

Claude Code turned every engineer into three

By Lazar MilicevicJuly 3, 20269 min read

Developer workstation with multiple monitors running Claude Code, tripling engineering productivity

Anthropic reportedly told its growth team to hire more product managers, not more engineers. The stated reason, per recent industry coverage, is that Claude Code quietly turned their engineering org into a team that ships at roughly 3x its headcount, and the constraint moved from the IDE to the people deciding what to build. I read that and laughed, because it matches almost exactly what happened in my own workflow over the last year.

I want to walk through what actually changed when Claude Code became the default way I write software, where the new bottleneck showed up, and what I think teams should do about it now.

The bottleneck moved out of the editor

For most of my career, the rate limiter on a project was typing. Not typing in the literal sense, but the whole loop: reading unfamiliar code, remembering an API surface, wiring boilerplate, writing the test, running it, fixing the import, rerunning. On a solo build like BizFlowAI ContentStudio, a "small" feature (say, adding a new content source with dedup and a scheduled pull) used to be a two-day job. With Claude Code driving the edits under my direction, it's closer to two focused hours. The code isn't better than what I'd write by hand. It's roughly the same. What changed is the elapsed time between "I know what I want" and "it's running in staging".

That means the constraint is no longer implementation. It's:

Knowing what to build.
Knowing what "done" looks like.
Reviewing what the agent produced before it rots the codebase.

Anthropic naming this out loud, by hiring PMs instead of more engineers, is the honest version of what a lot of teams are already feeling but not saying. If your engineers get 3x more done, you don't need 3x more engineers. You need enough product thinking to keep 3x throughput pointed at things that matter.

What "3x" actually looks like on a real project

I don't love the 3x number in the abstract, because it hides where the multiplier comes from. Here's the split I see on my own work, roughly:

Task type	Speedup with Claude Code	Notes
Greenfield module, well-defined spec	4-6x	Agent nails scaffolding, tests, types
Refactor across many files	3-5x	Huge win, but review gets expensive
Debugging a weird production issue	1.5-2x	Speeds hypothesis generation, not root-cause
New integration with a documented API	3-4x	I still read the API docs myself
Undocumented internal legacy code	1-1.5x	Context loading is the limit
Product decisions, prioritization	~1x	No speedup at all

The blended average lands in the "roughly 3x" zone for the kind of work I do (backend, integrations, cloud, LLM pipelines). But look at the last row. Zero speedup on the thing that decides whether any of the other rows should have been done.

That's the whole story in one table.

What actually took over as the bottleneck in my week

I run a solo shop and multiple client projects, so I get a very clean read on where hours go. After Claude Code became my default, here's what expanded to fill the freed time:

Reviewing generated code. When you ship 3x more diffs, you also review 3x more diffs. If you skip this, you get subtle bugs that all look like plausible code, which is worse than obvious bugs. I now spend real time on review, and I've moved to smaller, more frequent PRs specifically so review stays tractable.

Specifying the work. The agent is only as good as the spec. "Add a Reddit source" produces mush. "Add a Reddit source that pulls from these three subs on a 15-minute cron, dedupes on permalink, stores raw JSON in S3 under sources/reddit/{date}/, and emits a SourceIngested event to EventBridge with schema X" produces a mergeable PR. The second prompt takes 10 minutes to write. That 10 minutes is now the actual work.

Deciding what not to build. This is the one that surprised me. When implementation is cheap, the temptation to build everything goes up. My backlog of "sure, let's try it" grew faster than my ability to evaluate whether any of it moved a metric. I had to get more ruthless, not less.

Evals and monitoring. For LLM features, I now write evals before the feature, because the agent will happily produce a pipeline that looks right and silently regresses when a prompt drifts. Evals are the new tests. If you skip them, your "3x throughput" is 3x liability.

The review problem is the real problem

Here's the trade-off nobody warned me about. Claude Code produces code that reads well. That's dangerous, because well-read code slips past casual review. The bugs I catch now are almost never syntactic. They're semantic:

Using a library method that exists but does something subtly different from what the agent assumed.
Handling an error path in a way that looks reasonable but violates the system's actual retry semantics.
Adding a database index that's technically correct but competes with an existing one.
Passing user input to an LLM call without the guardrails I have on every other LLM entry point.

None of these are obvious in a diff. All of them require you to hold the whole system in your head while reading.

My working rules for review now:

Every agent-generated PR gets read by a human who understands the system. No exceptions, no "it's a small change".
Small PRs, always. I'd rather review 8 focused PRs than 1 mega-diff, even if the total lines are the same.
The agent writes the test first, I read the test, then the agent writes the code. If the test is wrong, the code will be confidently wrong. Reviewing the test is cheaper and higher-leverage than reviewing the implementation.
Every LLM call goes through the same wrapper. Guardrails, prompt injection filtering, logging, retries. When a new call is added, I check that it uses the wrapper. This one has caught real issues.

If you don't invest in this, the productivity gain evaporates into an incident review three months later.

What product thinking actually means at 3x throughput

The Anthropic move to hire PMs instead of engineers is a specific claim: that at their throughput, deciding what to build is the harder job than building it. I think that's right, but "hire more PMs" is only half the answer. The other half is that engineers themselves need to get better at product thinking, because the ratio of decisions-to-code went up, and every senior engineer is now, whether they like it or not, making product calls in every prompt they write to the agent.

The product-thinking skills that matter more now, in my experience:

Framing the problem in outcomes, not features. "We need a dedup step" is a feature. "We're getting complaints about duplicate posts and it's costing us retention on the free tier" is a problem. The agent turns problems into features fine. It cannot turn features into problems.
Knowing what to measure before shipping. If you can't say what number will move, you're not ready to prompt the agent yet.
Cutting scope aggressively. When implementation is free, scope discipline is the only real cost control. Every PM I've worked with who's good at this is worth several engineers now.
Deciding when the agent is the wrong tool. For a genuinely novel algorithm, an architectural change, or a security-sensitive path, I still write the first draft by hand. Knowing where that line is, is a senior skill that is getting more valuable, not less.

A concrete example from my own stack

To make this less abstract: on the content pipeline in BizFlowAI, I added a new capability last month, a self-critique step that re-reads a generated article, scores it against the target keyword's SERP intent, and either publishes or loops for another pass. The old me would have estimated 3 to 5 days for that: write the critique prompt, wire it into the state machine, add retry logic, add cost caps, add logging, write tests.

The agent-assisted version took under a day of actual coding. But the whole project (deciding whether to add it, defining what "good enough to publish" means as a numeric threshold, writing the eval set of 40 articles with human-labeled scores to calibrate the critique against, watching the first week of production output to see if the critique was correlated with actual search performance) took closer to two weeks.

The coding was 10% of the project. The other 90% was product and calibration work. That is exactly the shift Anthropic is describing.

What I'd do if I were running a team right now

Opinionated version, based on how I actually work and what I'd tell a CTO who asked:

Do not hire more engineers this quarter. Measure your team's actual throughput with Claude Code (or similar) properly used for 60 days. You almost certainly have more capacity than you think.
Hire or promote a serious product thinker per pod. Not a JIRA ticket writer. Someone who can define outcomes, cut scope, and say no.
Make review a first-class investment. Budget time for it, tool it, and reward it. If your best engineer's job title now includes "senior reviewer of agent output", that is not a demotion, it is the highest-leverage role on the team.
Build an evals discipline before you build more features. For every LLM-in-the-loop feature, there is an eval set. No eval, no ship.
Standardize your agent workflows. How prompts are structured, how the agent is scoped to files, how PRs are opened. When everyone has their own style, review becomes chaos. When it's consistent, review becomes a rhythm.
Track the ratio of prompt-time to code-time. If your engineers are spending 90% of their day typing code by hand, they're leaving throughput on the table. If they're spending 90% prompting and 10% reading, they're accumulating bugs. Neither is right. Somewhere near 30% specifying, 30% reviewing, 20% deciding, 20% coding by hand feels closer to right for me.

The teams that will pull ahead over the next 18 months aren't the ones with the most engineers, or even the most sophisticated agent setups. They're the ones whose product thinking got sharp enough to keep 3x throughput pointed at the right targets. Anthropic saw it in their own org and acted on it. Most companies I talk to are still hiring like it's 2022.

Close

If you're a founder or head of engineering staring at a growing backlog and wondering whether the answer is more engineers, more product people, or a smarter agent setup, I'd argue it's usually the middle one plus a serious look at review and eval discipline. That is what has actually moved the needle on my own work and on the projects I've helped rebuild.

If any of this maps to what you're seeing, I write more of these on the blog, and you can reach me at lazar-milicevic.com/#contact. Happy to compare notes.

Frequently asked questions

How much faster does Claude Code actually make software engineering work?

In my own workflow, Claude Code delivers roughly a 3x blended speedup, but the multiplier varies sharply by task type. Greenfield modules with a clear spec run 4-6x faster, refactors across many files 3-5x, new integrations with documented APIs 3-4x, and debugging weird production issues only 1.5-2x. Undocumented legacy code sees almost no speedup because context loading is the real limit, and product decisions get zero speedup. So the '3x' number is real for implementation work, but it hides that the bottleneck moves to deciding what to build.

Why is Anthropic hiring product managers instead of more engineers?

According to recent industry coverage, Anthropic told its growth team to hire PMs because Claude Code effectively turned their engineering org into one that ships at about 3x its headcount. When implementation stops being the constraint, adding more engineers doesn't help - the new bottleneck is deciding what to build, defining what 'done' looks like, and reviewing agent output before it degrades the codebase. If engineers get 3x more done, you need enough product thinking to point that throughput at things that actually matter.

What is the biggest risk of using AI coding agents like Claude Code?

The biggest risk is that agent-generated code reads well, which lets subtle semantic bugs slip past casual review. The bugs I catch now are rarely syntactic - they're things like using a library method that does something subtly different from what the agent assumed, handling errors in ways that violate the system's actual retry semantics, adding database indexes that compete with existing ones, or passing user input to an LLM call without the guardrails used elsewhere. None of these are obvious in a diff; all require holding the whole system in your head. If you skip rigorous review, the productivity gain evaporates into liability.

How should you write prompts for Claude Code to get mergeable pull requests?

Vague prompts produce mush; specific prompts produce mergeable PRs. Instead of 'Add a Reddit source', I write something like: 'Add a Reddit source that pulls from these three subs on a 15-minute cron, dedupes on permalink, stores raw JSON in S3 under sources/reddit/{date}/, and emits a SourceIngested event to EventBridge with schema X.' That specification takes about 10 minutes to write, and that 10 minutes is now the actual work. The agent is only as good as the spec, so precise inputs, storage paths, event schemas, and edge-case handling are what turn a coding agent into a productive collaborator.

What are the best practices for reviewing AI-generated code?

My working rules are: every agent-generated PR gets read by a human who understands the system, with no exceptions for 'small' changes. Keep PRs small and focused - I'd rather review 8 focused PRs than 1 mega-diff. Have the agent write the test first, review the test, then let it write the implementation, because a wrong test produces confidently wrong code and reviewing tests is cheaper and higher-leverage. Route every LLM call through a single wrapper that handles guardrails, prompt injection filtering, logging, and retries, and verify new calls use it. For LLM features specifically, write evals before the feature ships - evals are the new tests.

Lazar Milićević

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Work with me →

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts