AI · Automation · Engineering

Less Autonomy, More Value: The Morgan Stanley Lesson

By Lazar MilicevicJuly 1, 202610 min read

Financial analyst workstation with multiple monitors displaying market data, illustrating Morgan Stanley's controlled AI approach

Morgan Stanley just did something quietly radical: they put AI agents inside daily profit and loss reconciliation, one of the most accuracy-critical, deadline-driven jobs on Wall Street, and cut the work in half. The counterintuitive detail buried in the reporting is that they got there by making the agents less autonomous, not more. That matches almost every production agent deployment I have shipped or reviewed in the last two years, and it is the single hardest lesson to sell to executives who bought the "fully autonomous AI workforce" pitch.

I want to unpack why constraining agents wins, what the reconciliation workflow actually looks like from an engineering standpoint, and how I design human checkpoints into agent systems without gutting the productivity gain.

What Morgan Stanley actually built, and why P&L reconciliation is the perfect test case

P&L reconciliation is the daily process of comparing what the trading systems say a desk made or lost against what the books and records systems, prime brokers, and downstream risk systems say. Every discrepancy has to be explained, categorized, and resolved before the numbers get certified. In a large bank, a single trading desk can generate thousands of breaks per day across FX rates, corporate actions, timing differences, fee calculations, and stale market data.

It is a great agent target for four reasons:

The task is decomposable. Fetch a break, pull context from N systems, hypothesize a cause, propose a resolution.
Ground truth exists. Every break has a correct explanation, eventually.
The cost of a wrong answer is high but bounded. A misclassified break gets caught downstream. A misclassified and unreviewed break becomes a regulatory problem.
The humans doing it are expensive and burned out. Product controllers with CFAs spending their day on Excel lookups is textbook automation ROI.

According to the VentureBeat coverage of the deployment, Morgan Stanley reduced the reconciliation workload by roughly 50%. The agents do not close breaks. They investigate, gather evidence, propose an explanation, and hand a packaged case to a human. The human decides.

That last sentence is the whole game.

Why "less autonomous" outperforms "more autonomous" in high-stakes workflows

There is a graph I sketch on whiteboards for clients that has two axes: agent autonomy on the x-axis and realized business value on the y-axis. The curve is not monotonic. It rises, plateaus, and then falls off a cliff.

The cliff is where three failure modes compound:

Silent errors accumulate. An agent that closes its own tickets produces a stream of decisions no one audits until a downstream system breaks.
Trust collapses on the first bad autonomous action. One misfiled reconciliation and the desk head bans the tool. I have watched a six-month rollout die in a single afternoon over this.
Regulators and internal audit stop signing off. In banking, insurance, healthcare, and legal, "the model decided" is not a defensible answer. "The model recommended and a licensed human approved" is.

The peak of that curve, the sweet spot, is what I call assistive autonomy: the agent does 90% of the cognitive work (fetching, correlating, hypothesizing, drafting) and the human does 10% (verifying, approving, catching edge cases). That is where you get the 50% throughput gain without the tail risk.

Morgan Stanley picked the peak on purpose. Every deployment I have shipped that lasted more than a year sat at that same peak.

The architecture I use for keeping humans in the loop by design

Here is the pattern I use for regulated or high-stakes agent workflows. It is not glamorous, but it survives contact with real operations.

1. Deterministic scaffolding, LLM reasoning inside the cells

The overall workflow is a state machine defined in code. Steps like "fetch the trade from OMS," "pull the price from the risk system," "check the corporate actions calendar" are deterministic tool calls with strict schemas. The LLM's job is to reason inside the cells: given this context, what is the likely cause? Draft the explanation. Rank the top three hypotheses.

# Simplified state machine for a reconciliation agent
class ReconciliationCase:
    def run(self, break_id: str) -> CasePackage:
        break_data = self.fetch_break(break_id)          # deterministic
        context = self.gather_context(break_data)        # deterministic
        hypotheses = self.llm_hypothesize(context)       # LLM reasoning
        evidence = self.gather_evidence(hypotheses)      # deterministic
        ranked = self.llm_rank_and_explain(evidence)     # LLM reasoning
        return self.package_for_review(ranked)           # deterministic
        # NOTE: no self.resolve() method exists. That is the human's job.

That last comment is the design. There is literally no code path for the agent to close a case. The affordance does not exist.

2. Confidence, not probability

Ranking hypotheses by "confidence" from an LLM is a trap. Raw log probabilities on the final token are not calibrated confidence. I score cases on three separate axes and only send high-scoring cases to fast-path review:

Evidence completeness: did the agent actually retrieve all the sources it should have? (deterministic check)
Hypothesis consistency: do multiple sampling passes converge on the same cause? (self-consistency check)
Historical precedent: does this pattern match resolved breaks in the last 90 days? (retrieval against a labeled corpus)

Cases scoring high on all three go to a one-click approve UI. Cases scoring low go to a full manual review with the agent's work visible but flagged.

3. The reviewer UI is the actual product

Engineers underweight this. The productivity gain does not come from the agent being smart. It comes from the reviewer opening a case and seeing:

The break in plain English at the top.
The three most likely causes with a clear "why" for each.
The supporting evidence, already fetched, one click away.
The proposed resolution pre-filled but not committed.
A single keyboard shortcut to approve, reject, or edit.

If the reviewer has to switch tabs, the entire benefit evaporates. In a project I ran that saved 73 hours a month across four connected workflows, roughly two thirds of the savings came from UI design and one third from the agent itself. That ratio surprises people every time.

4. Every action is logged as a case artifact

Every tool call, every prompt, every LLM response, every human decision is stored as an immutable artifact linked to the case. This is not just for audit. It is the training data for the next generation of the system and the evidence you need when a regulator asks "how did you arrive at this number on July 3?"

The trade-offs no one talks about at conferences

Building this way costs you three things. You should know them going in.

You give up the sci-fi demo. "Watch the agent resolve 10,000 breaks overnight" makes a great video and a terrible production system. The demo executives approve is not the system that actually works.

You add latency. Human-in-the-loop means cases sit in queues. For a daily P&L close, that is fine, the humans are there anyway. For a real-time trading decision, it is not. Pick the right workflows.

You need to invest in the reviewer experience. If your ops team already hates their tools, dropping an "AI cockpit" on top of them will not fix it. Sometimes I spend the first six weeks of an engagement just fixing how humans see their existing work before I introduce any agent at all.

The upside is that the system compounds. Every approved case is a labeled training example. Every rejection is a hard negative. After 6 to 12 months you have a proprietary dataset that is genuinely hard for a competitor to replicate, and the fraction of cases that need full manual review keeps shrinking.

A concrete comparison: fully autonomous vs. assisted, from real deployments

Here is what the numbers actually look like across the deployments I have seen, without naming clients. This is directional, not a benchmark.

Metric	Fully autonomous	Assisted (human-in-loop)
Throughput gain vs. baseline	60 to 80% (claimed)	40 to 55% (measured)
Sustained after 6 months	Rarely	Almost always
Silent error rate	2 to 8%	Under 0.5%
Audit / compliance sign-off	Blocked or limited	Approved
Time to production	3 to 6 months	2 to 4 months
Ops team adoption	Resisted	Requested

The fully autonomous column looks better on the top line and loses on every line that matters for staying in production. The assisted column is what actually ships and stays shipped.

How to design the checkpoints without killing the productivity gain

The failure mode on the other side is checkpoint theater: making humans "approve" so many things that you have added work instead of removing it. A few rules I use:

Batch by pattern, not by case. If 200 breaks share the same root cause (a stale FX rate feed), the reviewer approves the pattern once, not 200 times.
Escalate on divergence, not on category. Do not require review for "all corporate actions." Require review when the agent's confidence signals disagree with each other.
Auto-approve the boring stuff, but log it. Genuinely trivial cases (identical amount, identical timing, seen 500 times before) can go through without review, as long as they are sampled for QA weekly.
Give the reviewer a "why did you flag this?" answer. The most demoralizing UI is one that says "please review" with no reason. Every escalation should carry the specific signal that triggered it.

Get this right and the human spends their time on the 10% of cases that need judgment, which is the work they actually enjoy.

What I'd do if I were starting a similar deployment tomorrow

If a bank, insurer, or B2B SaaS company came to me on Monday and said "we want to do what Morgan Stanley did, for our workflow," here is the sequence:

Pick one workflow, not a platform. One team, one process, one measurable output. No "AI transformation." Ship one thing that saves one team real hours.
Instrument the current process for two weeks before writing any AI code. You cannot claim a 50% improvement if you never measured the baseline.
Build the reviewer UI first, wired to a dumb rules-based backend. Prove the humans want to work in the new interface before you introduce any LLM.
Add the agent behind the UI, with no autonomous close path. The state machine cannot resolve cases. Only humans can.
Instrument every case as a training artifact from day one. Do not retrofit this. It is painful later.
Report savings in hours and dollars, not model accuracy. The CFO does not care about F1 scores. They care about full-time-equivalent capacity freed up and error rates staying flat or improving.
Only then, and carefully, start auto-approving the boringest tail. Never the interesting cases. Never the rare ones.

This sequence is unsexy. It also works.

The real lesson

The public narrative around AI agents in 2026 is still dominated by autonomy maximalism: more tools, longer horizons, less supervision. Morgan Stanley just showed, in one of the most scrutinized workflows in finance, that the productive frontier is somewhere else. It is the workflow where the agent does the tedious 90% and hands a well-packaged decision to a human who does the 10% that requires judgment, accountability, and a signature.

That is the pattern I keep building toward, and it is the pattern I will keep recommending to every CTO and head of engineering who asks me why their "autonomous agent" pilot stalled at 30% adoption.

If you are working on an agent deployment in a high-stakes workflow and want a second pair of eyes on the architecture, I am happy to talk. You can reach me at lazar-milicevic.com/#contact, or read more of how I build these systems on the blog.

Frequently asked questions

Why does making AI agents less autonomous produce better business results in high-stakes workflows?

In my experience shipping production agent systems, the relationship between autonomy and business value is not linear: value rises, plateaus, then falls off a cliff as autonomy increases. Fully autonomous agents accumulate silent errors, destroy user trust the first time they make a visible mistake, and fail regulatory and audit scrutiny in regulated industries. The sweet spot is what I call assistive autonomy, where the agent does about 90% of the cognitive work (fetching, correlating, hypothesizing, drafting) and a human does the final 10% (verifying and approving). Morgan Stanley's roughly 50% workload reduction in P&L reconciliation came from deliberately sitting at that peak rather than pushing for full automation.

What did Morgan Stanley's AI agents actually do in daily P&L reconciliation?

Morgan Stanley deployed AI agents into daily profit and loss reconciliation, the process of comparing what trading systems say a desk made or lost against books, prime brokers, and downstream risk systems. The agents do not close reconciliation breaks themselves. They investigate each break, pull context from multiple systems, hypothesize likely causes, gather supporting evidence, and package a case for a human product controller to review and decide. According to VentureBeat's coverage, this approach cut the reconciliation workload by roughly 50% while keeping humans as the final decision-makers.

What makes P&L reconciliation a good use case for AI agents?

P&L reconciliation is well suited to agents for four reasons I look for in any automation target. First, the task is decomposable into clear steps: fetch a break, pull context, hypothesize a cause, propose a resolution. Second, ground truth exists because every break has a correct explanation eventually. Third, the cost of a wrong answer is high but bounded, since errors get caught downstream as long as humans review. Fourth, the humans currently doing it, product controllers spending their day on Excel lookups, are expensive and burned out, which makes the ROI obvious.

How should you architect an AI agent workflow to keep humans in the loop by design?

The pattern I use is deterministic scaffolding with LLM reasoning inside the cells. The overall workflow is a state machine defined in code, where steps like fetching trades, pulling prices, and checking calendars are deterministic tool calls with strict schemas, while the LLM only reasons within those steps to hypothesize causes and draft explanations. Critically, I do not implement any code path for the agent to close or resolve a case, so the affordance to act autonomously simply does not exist. This forces every decision through a human reviewer while still capturing most of the productivity gain.

How should you score AI agent confidence to decide which cases need full human review?

Using raw LLM token probabilities as confidence is a trap because they are not calibrated. Instead, I score each case on three separate axes: evidence completeness (a deterministic check that the agent retrieved all required sources), hypothesis consistency (whether multiple sampling passes converge on the same cause), and historical precedent (whether the pattern matches resolved cases in the last 90 days via retrieval against a labeled corpus). Cases scoring high on all three go to a fast-path one-click approval UI, while low-scoring cases go to full manual review with the agent's work visible but flagged. This routing is where most of the actual throughput gain comes from.

Lazar Milićević

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Work with me →

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts