The AI Implementation Process I Use With Every Client

Most AI projects do not fail at the model. They fail in the six weeks before anyone writes a prompt, and in the six weeks after the demo lands in a Slack channel and nobody knows who owns it. I have run enough of these now (from one-off automations to multi-agent content systems running unattended) that the process has converged into something stable. This is the version I actually use.
It has five phases: scoping, POC, integration, evaluation, operations. Each phase has an exit criterion. If we cannot meet the exit criterion, we do not move forward. That single rule has saved more projects than any clever architecture choice.
Phase 1: Scoping (1 to 2 weeks, fixed price)
Scoping ends with a written document that names the workflow being automated, the system of record it touches, the success metric in hours or dollars, the data we have access to, and the smallest possible first slice. No model is chosen yet. No code is written. If we cannot produce that document, the engagement stops here and the client keeps the document.
The hardest part of scoping is resisting the urge to solve the interesting problem. Clients almost always describe the AI-shaped fantasy ("an agent that handles all support tickets") when the real opportunity is narrower and uglier ("triage tier-1 tickets that mention billing, route to the right queue, draft a reply for human approval"). The narrower version ships. The fantasy does not.
I run scoping as three sessions:
- Workflow walkthrough. Someone who actually does the work shows me their screen for an hour. I record it. I take timestamps. The point is to find the moments where a human is doing pattern matching that an LLM can do, and to find the moments where they are doing judgment that an LLM should not do.
- Data audit. Where does the input live? Where does the output need to go? What is the auth story? If the data is locked inside a SaaS product with no API and no export, that is the project, and we deal with it now, not in week six.
- ROI sizing. Hours per month times burdened hourly cost, minus realistic infrastructure and maintenance. If the answer is under $20k/year in savings, I usually tell the client to wait. The build and the babysitting are not worth it at that scale yet.
Exit criterion: a one-page scope with a single first slice, a measurable success metric, and a named human owner on the client side. No owner, no project.
Phase 2: Proof of Concept (2 to 4 weeks)
The POC has one job: kill the project cheaply if it cannot work. I treat the POC as adversarial. I am trying to find the reason this will not ship, before we spend integration money on it.
Concretely, a POC for me looks like this:
- A thin script or notebook, not a product. No UI unless the UI is the risk.
- The real model we plan to use in production (Claude Sonnet, GPT-4 class, or a local Llama/Qwen via Ollama if data residency matters), not a cheaper proxy. Cheap models lie about feasibility.
- A hand-curated set of 30 to 50 real examples from the client's actual data, with expected outputs written by a human who knows the domain.
- A rough eval harness, even if it is just a spreadsheet with pass/fail and notes.
The POC answers four questions in order:
| Question | What "no" means |
|---|---|
| Does the model produce the right shape of output reliably? | Schema issues, structured-output failures. Fixable. |
| Does it produce the right content on easy cases? | Capability gap. Sometimes fixable with retrieval or examples. |
| Does it handle the long tail without catastrophic failures? | The real risk. Often the project killer. |
| Can we detect when it is wrong? | If no, the project cannot ship to production. Full stop. |
That last question is the one most people skip. An AI system you cannot evaluate is an AI system you cannot trust, and an AI system you cannot trust is a demo, not a product. I have walked away from POCs that worked 90% of the time because there was no signal to catch the 10%.
Exit criterion: measurable performance on the eval set that the client agrees is good enough to justify integration cost, plus a documented failure mode list.
Phase 3: Integration (3 to 8 weeks)
This is where most of the actual work lives, and where most of my time goes. The model is usually the easy part by now. The integration is what makes it real.
My default stack for production AI work:
- Orchestration: simple, explicit code first. I reach for LangGraph or a hand-rolled state machine only when the workflow genuinely has branches and loops. Most "agents" are a sequential pipeline pretending to be agentic.
- Storage: Postgres for everything, with pgvector when retrieval matters. Supabase if the client wants managed. I do not introduce a separate vector DB until pgvector measurably stops scaling, which is later than people think.
- Retrieval: hybrid search (BM25 + dense) with reciprocal rank fusion. Pure semantic search loses on exact identifiers, SKUs, error codes, names. Pure keyword loses on paraphrase. RRF is the cheap fix.
- Compute: AWS Lambda + EventBridge for scheduled and event-driven work, API Gateway when something needs to be called. Scales to zero, which matters for workloads that run hourly or in bursts.
- Frontend (when needed): Next.js with server actions. Boring is good here.
Three integration details I now treat as non-negotiable:
1. Idempotency keys on everything
Any external action (send email, create ticket, post to CRM) gets an idempotency key derived from the input. Retries are inevitable, duplicate side effects are not.
def idempotency_key(workflow_id: str, input_hash: str, step: str) -> str:
return f"{workflow_id}:{step}:{input_hash}"
2. A human-in-the-loop seam, even if unused
I always build the approval queue before I build the auto-send. Even if the client wants full automation eventually, shipping with human review for the first 2 to 4 weeks catches the failure modes the eval set missed. Turning approval off later is one config change.
3. Cost guardrails per workflow
Token budgets per execution, hard cutoffs, alerts at 50/80/100% of monthly budget. I have seen a single retry loop burn $400 in an hour. Never again.
Exit criterion: the system runs end to end on real production data, with logging, retries, idempotency, and a kill switch. Not perfect outputs yet, but the pipes are sound.
Phase 4: Evaluation (continuous, but formalized for 2 weeks)
Evaluation is not a phase you finish. It is a system you build once and keep running forever. But there is a discrete block of work to set it up, and that is what this phase is.
I build three layers of evaluation:
- Offline eval set. The 30 to 50 examples from the POC, grown to 100 to 300, with expected outputs and a scoring rubric. Run on every prompt or model change. This is your regression test.
- LLM-as-judge for open-ended outputs. For anything where there is no single correct answer (drafted emails, summaries, classifications with reasoning), I use a separate, stronger model with a calibrated rubric to score outputs. I have written about how to actually calibrate this so the judge does not just rubber-stamp. The short version: you score the judge against human labels on a held-out set, and you do not trust a judge you have not calibrated.
- Production telemetry. Every run logs inputs, outputs, model version, prompt version, latency, tokens, cost, and the downstream outcome (was the draft email sent as-is, edited, or rejected?). That last signal is gold. It is the closest thing to ground truth you get in production.
The trap here is treating eval as a one-time gate. Models change. Prompts drift. Data shifts. The eval set has to be re-run on every change and the production telemetry has to feed back into growing the eval set. If a real production failure happens, it goes into the eval set the same day.
Exit criterion: the client can answer "is the system still working correctly?" without calling me.
Phase 5: Operations and Handoff (2 to 4 weeks)
This is the phase that separates a project that survives from one that dies six months in when something breaks and nobody knows where to look.
What I deliver in operations:
- Runbook. A markdown doc with the top 10 things that can go wrong, how to detect them, and how to fix them. Real ones, from this system, not generic.
- Dashboards. Usually a simple internal page or a Grafana board: success rate, cost per day, queue depth, latency P50/P95, model errors. The client looks at this weekly.
- Alerts. Pager-worthy alerts on hard failures (pipeline stopped, cost spike, eval regression). Low-noise. If alerts cry wolf, they get muted, and then the real failure goes unnoticed.
- Versioned prompts and configs. In git, with a changelog. Prompt changes are deploys, not Slack messages.
- A maintenance retainer or a clean exit. Either I stay on for a defined number of hours per month, or I hand off to an internal team with a transition period. No silent fade-outs. Those end badly for both sides.
What I would do differently if I were starting over
A few opinions, after running this loop enough times:
- Spend more on scoping, less on the POC. A bad scope makes a great POC useless. I have never regretted an extra week of scoping. I have regretted skipping it.
- Pick the boring model. Use the strongest reliable model in your tier (Claude Sonnet or GPT-4 class) until you have a reason not to. Optimizing for cost too early picks fights you cannot win yet.
- Build the eval before the agent. Sounds backwards. It is not. If you cannot define what good looks like, you cannot build toward it.
- Treat the first 30 days in production as part of the build. Most of the real bugs surface there. Budget for it. Tell the client.
- Say no more often. The projects I have turned down have, on average, been better decisions than the ones I took. Wrong-shaped projects do not get better with effort.
The shape of this process is not unique to my work. What is mine is the calibration: which phases I now know to invest in, which exit criteria I refuse to skip, and which mistakes I have made enough times to write them down. That last category is the actual deliverable when you hire someone like me, more than the code.
If you are scoping an AI implementation and want a second pair of eyes on it before you commit budget, I am happy to look at it. Reach out at lazar-milicevic.com/#contact, or browse the rest of the blog for more on evaluation, RAG, and getting agents into production.
Frequently asked questions
What are the phases of a complete AI implementation process?
I use a five-phase process for every AI project: scoping, proof of concept (POC), integration, evaluation, and operations. Each phase has a strict exit criterion, and if we can't meet it, we don't move forward. Scoping takes 1-2 weeks and produces a written document, the POC takes 2-4 weeks to cheaply prove feasibility, and integration takes 3-8 weeks where most of the real work happens. This staged approach with hard gates has saved more projects than any architecture decision I've ever made.
Why do most AI projects fail in production?
In my experience, AI projects rarely fail because of the model itself. They fail in the six weeks before anyone writes a prompt (poor scoping) and in the six weeks after the demo when nobody owns the system in production. The two most common root causes are chasing an AI-shaped fantasy instead of a narrow, shippable slice, and building a system you cannot evaluate or trust. A model that works 90% of the time is useless if you have no signal to catch the other 10%.
What is the minimum ROI threshold to justify building an AI automation?
I typically tell clients to wait if the projected savings are under $20,000 per year. The math is straightforward: hours saved per month times burdened hourly cost, minus realistic infrastructure and maintenance costs. Below that threshold, the build effort and ongoing babysitting simply aren't worth it at current tooling maturity. Larger workflows with clear, measurable hours-or-dollars metrics are where AI automation actually pays off.
What should a proof of concept (POC) for an AI project actually include?
A proper AI POC is a thin script or notebook, not a product, and it uses the real production-grade model (Claude Sonnet, GPT-4 class, or local Llama/Qwen via Ollama) rather than a cheaper proxy that lies about feasibility. It needs 30-50 hand-curated real examples from the client's actual data with expected outputs written by a domain expert, plus a rough evaluation harness (even just a spreadsheet). The POC must answer four questions: does it produce the right output shape, right content on easy cases, handle the long tail safely, and can we detect when it's wrong? If you can't detect failures, the project cannot ship.
What is the best tech stack for production AI applications?
My default production stack is deliberately boring: Postgres with pgvector for storage and retrieval (no separate vector DB until pgvector measurably stops scaling), hybrid search combining BM25 and dense embeddings with reciprocal rank fusion, AWS Lambda plus EventBridge for scheduled and event-driven compute, and Next.js with server actions when a frontend is needed. For orchestration I start with explicit sequential code and only reach for LangGraph or a state machine when the workflow genuinely needs branches and loops. I also treat idempotency keys on every external action and a human-in-the-loop approval seam as non-negotiable, even when full automation is the eventual goal.
Building something hard with AI or automation? I am open to talk.
Get in touch