AI · Automation · Engineering

How to Run an AI POC That Ships to Production

By Lazar MilicevicJune 30, 20269 min read

Developer workstation with code on screen illustrating an AI POC moving from prototype to production

Most AI proofs of concept I get called in to rescue are not failing because the model is bad. They are failing because nobody decided, before writing a line of code, what "done" meant. The demo works on three cherry-picked inputs, the CFO asks "so when does this go live?", and the team realizes the path from notebook to production is six months longer than they thought. I have shipped, salvaged, and quietly killed enough POCs over the last few years to have a template I trust. Here it is.

Decide upfront whether a POC is even the right move

A POC is the right move when there is genuine technical uncertainty: can an LLM extract these fields reliably, can a retrieval system answer questions over this messy corpus, can an agent execute this workflow without a human babysitting it. A POC is the wrong move when the uncertainty is actually about adoption, integration, or process. In that case you want a pilot with real users, not a sandbox demo.

Before I scope anything, I ask the sponsor three questions:

What decision does this POC unblock? If the answer is "we want to see what AI can do," stop. That is a discovery workshop, not a POC.
What does success look like in a number? Accuracy above X on a held-out set. Cost per task under Y. Latency under Z at the 95th percentile.
Who owns the production system if this works? If there is no named owner with engineering capacity in the next quarter, the POC will rot regardless of how good it is.

If those three questions cannot be answered cleanly in a 45-minute call, the project is not ready. I have walked away from work for this reason and I have never regretted it.

Timebox aggressively: 2 weeks for evaluation, 4 weeks for the build

The default timebox I use is six weeks total, split into two phases.

Weeks 1-2: evaluation harness and feasibility. No UI. No integrations. The goal is to prove the core capability on real data with a measurable score. For a RAG project this means a labeled set of 50 to 200 question-answer pairs from actual users or actual documents, a retrieval pipeline wired up against the real corpus, and a scoring function. For an agent project it means a set of tasks with explicit success criteria and an evaluation loop that can run the agent end to end.

Weeks 3-6: thin vertical slice. One workflow, one user, one input source, one output. End to end, deployed somewhere real, with observability. Not pretty. Not feature complete. Just real.

The two-phase split matters because most teams skip phase one entirely. They go straight to building a UI on top of an unvalidated capability, then spend the remaining weeks tuning prompts to make the demo look good. That is how you arrive at a 92% demo and a 40% production system.

If a POC is going to take longer than six weeks, it is not a POC. It is a project, and it deserves project-grade scoping with milestones, not the loose energy of a POC.

Build the evaluation harness before the system

This is the single biggest leverage point I know. The team that writes the eval first ships to production. The team that writes the eval last ships a demo.

Your harness needs three things:

A frozen test set. Real inputs, expected outputs (or expected behaviors), with labels you would defend in a meeting. 50 examples is enough to start. Version it.
A scoring function. Exact match where you can. LLM-as-judge with a calibrated rubric where you cannot. For agents, task-completion checks plus side-effect assertions (did it write the right row to the right table).
A runner. A script that takes a system version, runs it across the test set, and emits a JSON report with per-example results and aggregate scores.

A minimal version in pseudocode:

def evaluate(system, test_set):
    results = []
    for case in test_set:
        output = system.run(case.input)
        score = score_fn(output, case.expected, case.rubric)
        results.append({
            "id": case.id,
            "input": case.input,
            "output": output,
            "score": score,
            "trace": output.trace,
        })
    return {
        "aggregate": summarize(results),
        "per_case": results,
        "version": system.version,
    }

Run this every time you change a prompt, a model, a chunk size, or a retrieval strategy. Track the numbers in a spreadsheet or a tiny dashboard. When you can show a chart of accuracy climbing from 54% to 81% over two weeks, the production conversation gets easy. When you cannot, the conversation gets philosophical, and philosophical conversations are where POCs die.

Use real data, real auth, real failure modes from day one

The fastest way to kill a POC's chances of shipping is to demo it on synthetic data. The second fastest is to bolt on production concerns at the end.

What I insist on, from day one, even when it slows the start:

Real production data, or a faithful sample of it. With PII handling agreed in writing. If legal cannot move fast enough to give you real data in week one, the POC will not ship anyway, so you might as well find that out now.
Real authentication. Even if it is just a service account against a staging instance. Integration auth is where 30% of POC-to-production time disappears.
Real failure modes. What happens when the LLM returns malformed JSON. What happens when retrieval returns nothing. What happens when the user's question is out of scope. Build the error paths before the happy path is shiny.

For one RAG project I worked on, the demo was beautiful against a curated set of 200 documents. The production corpus was 180,000 documents, 12% of which were scanned PDFs with OCR errors, and 8% of which were duplicates with conflicting information. None of that was visible until we ran the harness on a real sample. The fix added three weeks. Finding it in week two instead of week ten saved the project.

Pick the architecture that survives a 100x scale-up, not the one that demos fastest

The single biggest cause of "demo purgatory" I see is an architecture that works at demo scale and falls over at production scale. A Jupyter notebook calling the OpenAI API in a loop is not an architecture. It is a sketch.

Here is the rough decision table I use for the thin vertical slice:

Workload shape	What I reach for	Why
Sync user-facing, < 5 req/s, < 10s latency budget	Next.js or FastAPI on a single container, streaming responses	Simple, observable, easy to hand off
Async batch, scheduled, idempotent	AWS Lambda + EventBridge + SQS for retries	Scales to zero, costs nothing when idle
Long-running agents, multi-step, hours of work	Step Functions or a durable workflow engine, with checkpointing	Crashes do not lose state
Heavy retrieval, hybrid search, > 1M chunks	Postgres + pgvector with RRF over BM25 + vector, plus a rerank step	One database to operate, RRF beats either alone

The trap is choosing the architecture for the demo (everything in a notebook) and promising to "productionize later." Later does not come. Pick the production shape now, even if you only implement 20% of it. The remaining 80% is just filling in code, not redesigning.

For BizFlowAI ContentStudio, I made this call early: every step had to be a Lambda triggered by an EventBridge schedule, with state in Postgres and observability in CloudWatch, from the very first version. The first version did almost nothing useful. But adding capability after that was a matter of writing more handlers, not rewriting the system.

Instrument everything: traces, costs, and quality, in that order

If you cannot answer "what happened on this specific request" within 30 seconds, your POC is not shippable. Period.

The three things I wire up before I write the second prompt:

Tracing. Every LLM call, every retrieval, every tool invocation logged with inputs, outputs, latency, and a request ID that ties the whole chain together. Langfuse, Helicone, or even a Postgres table works. Just pick one.
Cost tracking per request. Token counts in, token counts out, model, computed dollar cost, written to the same table as the trace. When the CFO asks "what does this cost per user," you answer in seconds, not weeks.
Quality signal in production. Thumbs up/down, implicit signals (did the user re-ask), or LLM-as-judge running on a sample of live traffic. This becomes your post-launch eval set and your continuous improvement loop.

Without these, you cannot debug regressions, you cannot defend cost, and you cannot tell whether a prompt change helped or hurt. With them, your POC graduates to a system that learns.

What I'd do, in order

If I were starting a fresh AI POC on Monday with a six-week clock, here is the order:

Day 1-2. Write the success criteria in numbers. Get them signed off by the sponsor in email. Identify the production owner.
Day 3-5. Build the test set. 50 real examples, labeled. This is the most undervalued day of the project.
Day 6-10. Build the eval harness and a baseline system. Get a number. It will be embarrassing. Good.
Day 11-20. Iterate on the system, running the harness every change. Hit the target metric or decide it is not feasible. Either outcome is a win.
Day 21-30. Build the thin vertical slice with real auth, real data, real error paths, deployed to a real environment.
Day 31-42. Run with two or three real users. Collect production traces. Write the handoff doc: architecture, costs, known limits, next steps.

The handoff doc is what turns a POC into a production project. Without it, even a successful POC gets re-explained, re-litigated, and eventually re-built by the next team. With it, the production team has a runway.

The single most important habit across all of this: say "I do not know yet, the eval will tell us" out loud, often. It resets the room. It moves the conversation from opinion to measurement. It is also the truth.

If you are scoping an AI POC right now and want a second pair of eyes on the success criteria or the architecture, I am happy to take a look. You can reach me at lazar-milicevic.com/#contact, or read more on how I think about RAG evals, agent systems, and getting from PoC to production over on the blog.

Frequently asked questions

When should I run an AI proof of concept versus a pilot?

I run a POC only when there is genuine technical uncertainty, such as whether an LLM can reliably extract specific fields, whether retrieval works over a messy corpus, or whether an agent can execute a workflow unattended. If the real uncertainty is about user adoption, integration complexity, or process change, a sandbox POC is the wrong tool, I run a pilot with real users instead. The test I use is simple: if the question is 'can the technology do this?', run a POC; if it's 'will people use this and will it fit our workflows?', run a pilot. Picking the wrong format is one of the most common reasons AI initiatives stall.

How long should an AI POC take?

My default timebox is six weeks total, split into two phases: weeks 1, 2 for an evaluation harness and feasibility testing on real data, and weeks 3, 6 for a thin vertical slice deployed end to end with observability. If a POC needs more than six weeks, I stop calling it a POC, it is a project and deserves proper scoping with milestones rather than the loose energy of an exploration. The two-phase split is critical because teams that skip evaluation and jump straight to building a UI end up tuning prompts to make demos look good, which is how you get a 92% demo and a 40% production system.

What questions should I ask a sponsor before starting an AI POC?

Before I scope any AI POC, I ask the sponsor three questions: What decision does this POC unblock? What does success look like as a concrete number (accuracy, cost per task, latency at p95)? And who owns the production system if it works? If the answer to the first is 'we want to see what AI can do,' it is a discovery workshop, not a POC. If there is no named production owner with engineering capacity in the next quarter, the POC will rot regardless of how good it is. I have walked away from work when these cannot be answered cleanly in a 45-minute call, and I have never regretted it.

Why should I build an evaluation harness before building the AI system?

Writing the evaluation first is the single biggest leverage point I know for shipping AI to production, teams that build the eval first ship, and teams that build it last ship demos. A good harness needs three things: a frozen, versioned test set of at least 50 real labeled examples; a scoring function (exact match where possible, LLM-as-judge with a calibrated rubric otherwise, plus side-effect checks for agents); and a runner that produces a JSON report with per-example and aggregate scores. You should rerun it every time you change a prompt, model, chunk size, or retrieval strategy. When you can show accuracy climbing from 54% to 81% over two weeks, the production conversation becomes easy instead of philosophical.

Why do AI POCs fail to reach production?

In my experience, most AI POCs fail not because the model is bad but because nobody defined what 'done' meant before writing code, so the demo works on a handful of cherry-picked inputs and collapses on real data. The other common killers are demoing on synthetic data instead of real production data, deferring authentication and integration work until the end (which eats roughly 30% of POC-to-production time), and ignoring failure modes like malformed JSON, empty retrievals, or out-of-scope questions. I insist on real data, real auth, and real error paths from day one, even when it slows the start, because finding these issues in week two instead of week ten is what saves the project.

Lazar Milićević

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Work with me →

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts