AI · Automation · Engineering

PoC to Production: Scale AI Without a Rewrite

By Lazar MilicevicJune 25, 202610 min read

Modern data center server room representing scaling AI from PoC to production without a rewrite

Every PoC I've inherited from another team failed in the same place. Not the model. Not the prompts. The plumbing. Someone hardcoded the OpenAI client into a Flask route in week one, and by month six that route was the production system handling 40k requests a day with no retry logic, no eval harness, and a vector store running on a t3.medium that nobody had touched since the demo.

The fix is rarely "switch models" or "add Redis." It's that four architectural decisions made on day one quietly turned into load-bearing walls. If you lock them in correctly during the PoC, the path to production is iteration. If you don't, it's a rewrite — and the rewrite usually arrives the week after you sign your first enterprise contract.

Here's what I lock in early, drawn from systems I've shipped that now run unattended.

The four things that decide whether your PoC survives production

In short: (1) a model-agnostic LLM gateway, (2) an evaluation harness wired in before the first user touches it, (3) idempotent, observable task execution, and (4) a data layer that already speaks to production-grade retrieval. Get these right at PoC stage and scaling becomes a config change. Get them wrong and you'll be rewriting under deadline pressure, which is the worst time to make architecture decisions.

None of these require more than a day or two of extra work in the PoC phase. The cost of skipping them compounds non-linearly. I've watched teams spend six weeks doing what should have been a one-week migration because the original code assumed openai.chat.completions.create would always be the call site.

1. Build a model gateway on day one (not day 180)

Every LLM call in the system goes through one internal function. That's the rule. Not openai.chat.completions.create() scattered across 14 files. One gateway. It takes a task name, a payload, and routes to a provider based on config.

Here's the minimum viable version I drop into every new project:

# llm/gateway.py
from typing import Literal
import anthropic, openai, ollama

PROVIDERS = {
    "claude-sonnet": ("anthropic", "claude-sonnet-4-5"),
    "gpt-4o": ("openai", "gpt-4o"),
    "local-llama": ("ollama", "llama3.1:70b"),
}

def call(task: str, messages: list, model_key: str = None, **kwargs):
    model_key = model_key or ROUTING[task]  # task -> model map in config
    provider, model_id = PROVIDERS[model_key]
    
    trace_id = start_trace(task, model_key, messages)
    try:
        response = _dispatch(provider, model_id, messages, **kwargs)
        log_completion(trace_id, response)
        return response
    except Exception as e:
        log_failure(trace_id, e)
        if FALLBACK.get(model_key):
            return call(task, messages, FALLBACK[model_key], **kwargs)
        raise

What this buys you, in order of importance:

Provider switching is a config change. When Claude releases a new model, or OpenAI cuts prices by 60%, or your enterprise customer demands on-prem inference via Ollama, you change a dict. Not a codebase.
Cost and latency telemetry is centralized. Every call has a trace_id. You can answer "what did this user's session cost us?" in SQL, not by tailing logs.
Fallback logic lives in one place. When Anthropic's API has a regional incident at 3am — and it will — you fall through to OpenAI without paging anyone.
Caching, rate limiting, retries all attach here. Not to the 14 call sites you'd otherwise need to refactor.

The mistake I see: people build this gateway "later, once we know what we need." You already know what you need. The shape above covers 95% of cases. Build it Monday.

2. Evals before users — not after the first bad screenshot

The thing that distinguishes production-grade LLM systems from PoCs is not better prompts. It's that someone built an evaluation harness before they had to. The PoC has a developer eyeballing 5 outputs and saying "looks good." Production has a regression suite that runs on every prompt change and tells you when accuracy on edge case #47 just dropped from 94% to 71%.

You need three layers, and you need them at PoC stage even if each layer has only 20 examples in it:

Layer	What it tests	When it runs	Example size at PoC
Unit evals	Single-task correctness (extraction, classification, routing)	Every prompt/model change	30-50 cases
Trajectory evals	Multi-step agent paths reach the right end state	Nightly + pre-deploy	10-20 scenarios
Production sampling	Real traffic scored by LLM-as-judge or human spot-checks	Continuous, 1-5% sample	n/a (ongoing)

I keep evals in the same repo as the code, version-controlled, run via pytest. The judge is usually Claude or GPT-4o with a calibrated rubric — I wrote about calibrating LLM judges separately, because that's a deep topic on its own.

The number that matters: time from "I changed a prompt" to "I know if it's better or worse." If that's longer than 10 minutes, you will stop running evals when you're stressed, which is exactly when you need them. On my own ContentStudio pipeline that loop is under 4 minutes for the unit suite. That's why I can ship prompt changes daily without breaking production.

A real lesson: I once had a content-generation agent that scored 92% on my eval set but was producing visibly worse output in production. The eval set had drifted from the actual traffic distribution. Now I sample 50 production traces a week into the eval set, hand-label them, and rotate out the stale ones. Evals are a living dataset, not a one-time setup.

3. Assume every task will be retried (because it will)

LLM calls fail. Network blips, rate limits, content filter false positives, provider-side 500s, model deprecations mid-deploy. If your task is "summarize this 80-page PDF" and it fails 14 minutes in, you need to resume from chunk 27, not restart from chunk 1.

Three architectural rules I follow without exception:

Every task has a deterministic ID. Hash the inputs (prompt + context + model + version). If the same task is submitted twice, the second one returns the cached result. This is not just performance — it's idempotency. Your retry logic, your queue worker, and your user clicking the button twice should all converge to the same outcome.

Long-running work runs as discrete, resumable steps. I use EventBridge + Lambda for most of my serverless agentic pipelines because each step naturally checkpoints. For longer multi-hour jobs (full content cycles, deep research) I use a step-function pattern: state machine, each transition writes to Postgres, any step can be re-driven from its input. If you're using a queue, persist the intermediate state — never just the message.

Failures fail loudly, in one place. Every failure writes to a task_failures table with task_id, step, input hash, exception, model used, and trace_id. I have a dashboard that shows failure rate per task type per day. When it ticks up, I see it before a customer does.

The PoC version of this is maybe 80 lines of code. The "we'll add this later" version is 3 weeks of work and one embarrassing outage.

# tasks/runner.py
def run_task(task_id: str, input_payload: dict, steps: list):
    state = load_state(task_id) or {"completed_steps": [], "outputs": {}}
    
    for step in steps:
        if step.name in state["completed_steps"]:
            continue
        try:
            result = step.execute(input_payload, state["outputs"])
            state["outputs"][step.name] = result
            state["completed_steps"].append(step.name)
            save_state(task_id, state)
        except RetryableError as e:
            schedule_retry(task_id, step.name, e)
            return
        except Exception as e:
            record_failure(task_id, step.name, e)
            raise
    
    return state["outputs"]

4. The data layer that actually speaks production

The single most common PoC-to-production rewrite I see is the retrieval stack. Someone built the demo with ChromaDB on disk, or Pinecone's free tier, or — my favorite — a JSON file. Then traffic shows up and the retrieval layer either falls over or, worse, silently returns garbage because nobody's been measuring recall.

What I lock in at PoC stage:

Postgres + pgvector for the primary store. Not because it's the fastest vector DB, but because it's the one your team already knows how to back up, migrate, and query with SQL. For >90% of B2B SaaS workloads you'll never outgrow it. I've run pgvector with several million chunks at sub-100ms p95 on modest hardware.
Hybrid search from the start. Pure vector search has a recall ceiling that's lower than people think — especially for queries with proper nouns, IDs, or specific terminology. I run BM25 (Postgres tsvector) and vector search in parallel and fuse with Reciprocal Rank Fusion. The implementation is maybe 30 lines. The recall lift is usually 15-30% on real corpora.
Document chunks have stable IDs and provenance. Every chunk knows what document it came from, what version of the chunking strategy produced it, and when it was indexed. When you change your chunking strategy in month 4 (you will), you can roll it out incrementally instead of rebuilding everything.
A re-ranking step in the chain. Even a small cross-encoder re-rank on the top 20 candidates from hybrid search gives noticeable quality gains, and it's cheap. I cover this in more depth in the RAG evaluation post.

The trade-off worth being honest about: pgvector at very high scale (tens of millions of vectors, sustained high QPS) does get out-performed by purpose-built stores. If you're genuinely in that regime — and most teams telling me they are, aren't — switch when the numbers force you to, not before. The migration is a week of work when retrieval is behind a clean interface. Which it is, because you built the gateway pattern from rule #1.

The boring stuff that decides whether you sleep at night

These aren't glamorous, but skipping them is what turns a 2am page into a 6-hour outage:

Structured logging with trace IDs end-to-end. Every request gets an ID. Every LLM call, DB query, retrieval, tool call carries it. When something goes wrong you grep one ID and see the whole story.
Cost budgets per tenant/task. A runaway agent loop can spend $400 in 20 minutes. Set hard caps. Alert at 50%, kill at 100%.
Prompt versioning. Prompts are code. They live in the repo, get reviewed, get tagged. The deployed prompt version is logged with every call. When quality drops, you can answer "what changed?" instantly.
A kill switch. One env var that disables agent execution and falls back to a static or human-in-the-loop path. You will need it.

What I'd do if I were starting a new AI product Monday

Spend the first week on architecture, not features. Specifically:

Day 1-2: Build the LLM gateway and routing config. Wire in two providers and a local fallback.
Day 2-3: Set up Postgres with pgvector, hybrid search, and a tiny eval harness (10 cases is enough to start).
Day 4: Build the task runner with idempotent IDs and step-level checkpointing. Wire in structured logging.
Day 5: Implement cost tracking and the kill switch. Deploy a "hello world" end-to-end trace.

Then build features. Every feature you add slots into a system that's already production-shaped. You'll move slower in week one and faster every week after.

The teams I see succeed treat the PoC as production with a smaller dataset and a higher failure tolerance. Same architecture. Same observability. Same eval discipline. Just less polish on the UI and fewer customers depending on it. The teams I see rewrite at month six treated the PoC as a throwaway and then couldn't throw it away because by then it was making money.

If you're staring at a working PoC and wondering what it takes to get it to handle real traffic without a rewrite, that's the work I do — happy to talk through it. You can find me at lazar-milicevic.com/#contact, or there's more on these patterns over on the blog.

Frequently asked questions

Why do most AI proof-of-concepts fail when scaled to production?

In my experience, PoCs rarely fail because of the model or prompts — they fail because of the plumbing. Teams hardcode an OpenAI client into a Flask route in week one, and six months later that same route is handling 40k requests a day with no retries, no eval harness, and a vector store nobody has touched since the demo. The root cause is that four architectural decisions made on day one quietly become load-bearing walls. If those decisions are wrong, scaling requires a full rewrite, usually right after you sign your first enterprise contract.

What is an LLM gateway and why should I build one on day one?

An LLM gateway is a single internal function that every LLM call in your system routes through, taking a task name and payload and dispatching to a provider based on config. I build one on day one because it turns provider switching into a config change, centralizes cost and latency telemetry under a trace_id, puts fallback logic in one place for when a provider has an outage, and gives you a single attachment point for caching, rate limiting, and retries. Without it, you end up refactoring 14 scattered call sites under deadline pressure. The minimum viable version takes about a day to build and covers 95% of real-world needs.

What evaluation layers do I need for a production LLM system?

I use three layers, and I put all of them in place at the PoC stage even with only 20 examples each. Unit evals (30-50 cases) test single-task correctness like extraction or classification and run on every prompt or model change. Trajectory evals (10-20 scenarios) verify that multi-step agents reach the right end state, running nightly and pre-deploy. Production sampling continuously scores 1-5% of real traffic using an LLM-as-judge or human spot-checks. The evals live in the same repo as the code, version-controlled, and run via pytest.

How fast should my LLM evaluation loop be?

The metric that actually matters is the time from changing a prompt to knowing whether it's better or worse. If that loop is longer than 10 minutes, you will stop running evals exactly when you're stressed and need them most. On my own production pipeline the unit eval suite runs in under 4 minutes, which is why I can ship prompt changes daily without breaking things. Keep the suite fast and version-controlled in the same repo as your code.

How do I keep an LLM evaluation set from going stale?

Evals are a living dataset, not a one-time setup. I learned this when a content-generation agent scored 92% on my eval set but was visibly worse in production — the eval distribution had drifted from real traffic. Now I sample 50 production traces a week, hand-label them, add them to the eval set, and rotate out the stale ones. This keeps the suite aligned with what users are actually sending and catches regressions you'd otherwise only notice from customer complaints.

Lazar Milicevic

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Work with me →

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts