What Real Generative AI Consulting Looks Like in 2026

Last month a founder emailed me asking for a "quick AI agent" that would read his support tickets and draft responses. Two hours into the scoping call, we had uncovered four separate systems, three data silos, and a compliance constraint that would have killed the project in week four. The "quick agent" was never the problem. The scoping was.
That gap between what someone imagines a generative AI engagement is and what it actually involves is where most projects fail. So I want to lay out exactly how I run a GenAI implementation from first call to handoff, what gets delivered at each stage, and the decisions that actually matter.
The Scoping Call: Killing Bad Projects Before They Start
A good scoping call should feel like an interrogation, not a sales pitch. I spend the first 60 to 90 minutes trying to kill the project. If it survives that, it has a real chance of shipping.
The questions I ask, in order:
- What is the manual process today, and who does it? If nobody is doing the task manually right now, there is no baseline to beat. Automation without a manual precedent is product development, not consulting. Those are different engagements with different timelines and risk profiles.
- What does it cost you per month in hours? This gives me the ROI ceiling. If the answer is "a few hours a week," the project probably cannot justify engineering cost. I need to see at least 20 to 30 hours of monthly manual effort before an AI automation is worth building.
- Where does the data live? Confluence, Zendesk, a Postgres database, scattered PDFs, someone's Notion. Each source has a different extraction pattern and a different latency profile. I have walked into projects where the "knowledge base" was 400 untagged Google Docs. That is a data engineering project before it is an AI project.
- What happens when the model is wrong? This is the question that surfaces the real risk tolerance. If a wrong answer costs the client a customer or a legal exposure, the architecture needs human-in-the-loop by design, not as an afterthought.
The deliverable from scoping is a one-page decision document. Problem statement, current manual cost, proposed automation, data sources, risk level, and a rough build estimate. If we cannot fill that page with real numbers, the project is not ready.
Architecture: The Decisions That Actually Matter
Most GenAI architecture debates I see online are about models. Which model is smartest, which is cheapest, which has the longest context window. Those matter, but they are secondary. The decisions that determine whether your system survives six months in production are structural.
Synchronous API call vs. scheduled worker. A user types a question and waits for an answer. Or a cron job fires every 15 minutes, pulls new records, processes them, and writes results somewhere. These are completely different architectures. The first needs low latency, streaming, and a fallback strategy. The second needs idempotency, retry logic, and observability. I build more of the second than the first because scheduled workers are easier to make reliable and they fit B2B workflows where batch processing is acceptable.
Single LLM call vs. multi-agent pipeline. I have built both. The BizFlowAI ContentStudio is a multi-agent system: one agent researches, one writes, one optimizes for SEO, one handles publishing. Each step has its own prompt, its own validation, and its own retry logic. This is more complex to build and debug, but it gives you control over each stage. A single-call approach is fine for simple extraction or summarization tasks. The moment you have three or more distinct cognitive steps, a pipeline wins.
Serverless vs. persistent containers. For event-driven workloads (a ticket arrives, trigger processing), serverless on AWS is my default. Lambda functions triggered by EventBridge, writing to Postgres, with S3 for any large blob storage. You scale to zero between events, which means your idle cost is near nothing. The trade-off is cold starts and the 15-minute execution limit. If your workflow takes 20 minutes to run, you need a container or a step function that chunks the work.
Here is how I think about the four most common patterns I ship:
| Pattern | Best for | Latency need | Cost profile |
|---|---|---|---|
| Sync API + RAG | Internal knowledge tools, search assistants | Under 3 seconds | Per-request, scales with usage |
| Scheduled batch worker | Content generation, data enrichment, reporting | Minutes to hours | Fixed, predictable |
| Event-driven pipeline | Support automation, lead processing, monitoring | Under 60 seconds | Bursty, scales with events |
| Multi-agent loop | Complex creative or analytical tasks | Variable | Highest, needs cost controls |
The table is not exhaustive but it covers roughly 90% of the B2B engagements I take on. If your use case does not fit any of these, you are likely building a product, not automating a process.
Model Selection: Stop Optimizing for the Wrong Thing
I choose models based on three criteria, and none of them is benchmark scores.
Task complexity. Is the model doing classification, extraction, summarization, or generative reasoning? For the first three, a smaller, cheaper model is almost always sufficient. For generative reasoning (write a blog post, analyze a dataset and recommend actions), you need a frontier model. I use Claude Sonnet for most production work because it balances quality and cost well. For simpler tasks, I route to Haiku or a local model via Ollama.
Context window requirements. How much text does the model need to process in a single call? If you are stuffing 100K tokens of context into every call, you are paying for those tokens on every request. RAG exists precisely to avoid this. A well-designed RAG pipeline with pgvector and hybrid search can get you to 80 to 90% of the accuracy of full-context stuffing at 5% of the token cost. I have written about this in detail before, and the trade-off is real: RAG adds engineering complexity but saves enormous runtime cost.
Consistency vs. creativity. Production systems need consistency. If your output varies wildly between runs with the same input, you cannot build reliable downstream logic on top of it. Temperature settings, structured output schemas, and prompt engineering all serve consistency. I default to temperature 0.0 or 0.1 for extraction and classification, and 0.3 to 0.5 for generation where some variation is acceptable.
The model landscape will change. New models will ship this quarter. The selection framework does not change. Pick the cheapest model that reliably does the task, and build your abstraction layer so you can swap models without rewriting your pipeline.
Integration Patterns That Survive Production
The model is the easy part. The hard part is wiring it into the client's existing systems so it actually runs without someone babysitting it.
Pattern 1: Read-process-write to a database. This is the simplest and most reliable integration. Your worker reads new rows from a source table, processes them through the LLM, and writes results to an output table. The client's existing application reads from the output table. No new API surface, no new infrastructure for the client to maintain. I used this pattern for an analytics migration that saved a client 30 to 60k EUR annually. The AI system replaced a manual reporting workflow and wrote directly to the same Postgres tables the dashboard was already reading from.
Pattern 2: Webhook in, webhook out. An external system (Zendesk, Slack, a CRM) sends a webhook to your endpoint, your Lambda processes the event through the LLM, and responds or calls back. I built a serverless AWS and Zendesk integration this way that achieved first-ever SLA compliance for the client. The pattern works because it is event-driven, scales to zero, and has clear failure handling.
Pattern 3: Scheduled publishing pipeline. This is what BizFlowAI ContentStudio does. A scheduler triggers research agents, which feed writing agents, which feed optimization agents, which publish to target systems via their APIs. The entire loop runs unattended. The key engineering challenge is making each stage idempotent: if stage 3 fails and retries, stages 1 and 2 should not re-run. I handle this with a job state table in Postgres that tracks each item's progress through the pipeline.
Here is a simplified version of the state tracking pattern I use in every multi-stage system:
CREATE TABLE pipeline_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_id TEXT NOT NULL,
stage TEXT NOT NULL CHECK (stage IN ('research', 'draft', 'optimize', 'publish')),
status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending', 'running', 'done', 'failed')),
attempts INT NOT NULL DEFAULT 0,
result JSONB,
error TEXT,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
UNIQUE(source_id, stage)
);
Every worker checks this table before processing. If the stage is already done, it skips. If failed with fewer than 3 attempts, it retries. This is not sophisticated engineering. It is the boring, reliable pattern that keeps a system running for months without intervention.
Handoff: What the Client Actually Gets
A working system is not a deliverable. A working system plus documentation, monitoring, and a handoff session is a deliverable.
What I hand over at the end of an engagement:
- A running system in the client's infrastructure. Not my sandbox. Their AWS account, their database, their API keys. I deploy everything to their environment and verify it works end to end.
- A system document, not a user manual. This covers the architecture, data flow, model choices, cost per run, failure modes, and where to look when something breaks. It is written for the next engineer, not for a non-technical stakeholder.
- Cost monitoring from day one. I set up CloudWatch alarms or equivalent that trigger if daily LLM spend exceeds a threshold. runaway token cost is the number one production issue I see. A bug in a retry loop can burn through hundreds of dollars in API costs in an afternoon. Monitoring is not optional.
- A 30-day stabilization period. After handoff, I monitor the system and fix issues that surface. No system is perfect on day one. Edge cases in real data will break assumptions you made during development. The stabilization period is where those get caught.
The biggest mistake I see from consultants and agencies is treating handoff as a single meeting. It is not. It is a phase that takes two to four weeks, during which the system is live but still actively monitored and adjusted.
The ROI Conversation: Real Numbers, Not Projections
I do not promise ROI in a scoping call. I promise to measure it after 60 days of production use.
The 73 hours per month I saved with a four-system automation ecosystem was not a projection. It was measured against the baseline of manual effort before the system was deployed. The 192% year-one ROI was calculated from actual API costs and infrastructure spend, not estimates.
Any generative AI consulting engagement should be able to answer this question within 60 days of going live: how many hours of manual work did this eliminate, and what did it cost to run? If you cannot answer that with real numbers, the system is not done.
What I Would Do Differently
If I were hiring someone for a GenAI implementation today, here is what I would look for, based purely on what separates the engagements that ship from the ones that stall.
Insist on a one-week scoping sprint before any build commitment. If the consultant wants to skip scoping and jump into building, they are going to build the wrong thing. The scoping sprint should produce a decision document, a rough architecture, and a realistic cost range. Pay for it. It is the cheapest insurance you will buy on the project.
Reject any architecture that depends on a single model or a single API. Models change. APIs change. Prices change. Your system should be abstracted enough that swapping from Claude to GPT to a local model is a configuration change, not a rewrite. This does not mean building elaborate abstraction layers on day one. It means not hard-coding model-specific logic into your business rules.
Ask for a production system they have run for more than six months. Not a demo, not a prototype, not a POC. A system that has been running unattended for months. The bugs you hit at month four are completely different from the bugs you hit in week two. If the person you are hiring has only built demos and POCs, they will not anticipate those issues.
The best engagement I run is one where the client barely notices the system after week six. It runs, it saves time, it does not break, and nobody thinks about it. That is the goal. Not a flashy demo, not a clever agent architecture, not the newest model. A boring, reliable system that quietly eliminates manual work month after month.
If you are scoping a GenAI project and want a second set of eyes on the architecture, or you need someone to actually build it, reach out at lazar-milicevic.com/#contact. I take on a small number of engagements at a time, and I am upfront about whether a project is ready to build or needs more scoping first.
Frequently asked questions
How much does it cost to implement a generative AI solution for my business?
I need to see at least 20 to 30 hours of monthly manual effort before a generative AI automation is worth building. The ROI ceiling is determined by what your current manual process costs in hours, and if the answer is just a few hours a week, the project likely cannot justify the engineering investment. During scoping, I produce a one-page decision document with real numbers, problem statement, current manual cost, proposed automation, data sources, risk level, and a rough build estimate, so you can make an informed decision before committing.
What should I expect during a generative AI consulting scoping call?
A proper scoping call should feel like an interrogation, not a sales pitch, I spend the first 60 to 90 minutes actively trying to kill the project, because if it survives that, it has a real chance of shipping. I ask what the manual process is today, what it costs monthly in hours, where the data lives, and what happens when the model is wrong. The deliverable is a one-page decision document with real numbers; if we cannot fill that page, the project is not ready to build.
How do I choose the right LLM model for my AI automation project?
I choose models based on task complexity, context window requirements, and cost, not benchmark scores. For classification, extraction, and summarization, smaller and cheaper models like Haiku or local models via Ollama are almost always sufficient. For generative reasoning tasks like writing or analysis, a frontier model is necessary, and I use Claude Sonnet for most production work because it balances quality and cost well. A well-designed RAG pipeline with pgvector and hybrid search can also dramatically reduce the context tokens you need to pay for on every request.
Should I use a single LLM call or a multi-agent pipeline for my AI project?
A single-call approach works fine for simple extraction or summarization tasks. The moment you have three or more distinct cognitive steps, like researching, writing, optimizing, and publishing, a multi-agent pipeline wins because it gives you control over each stage with its own prompt, validation, and retry logic. I built the BizFlowAI ContentStudio as a multi-agent system for exactly this reason. Multi-agent pipelines are more complex to build and debug, but they are the right architecture when the task has clearly separable stages.
What data preparation do I need before hiring a generative AI consultant?
Before any AI work begins, you need to know exactly where your data lives, Confluence, Zendesk, a Postgres database, scattered PDFs, or someone's Notion. Each source has a different extraction pattern and latency profile, and I have walked into projects where the knowledge base was 400 untagged Google Docs, which is a data engineering project before it is an AI project. The cleaner and more centralized your data is, the faster and cheaper the implementation will be. If your data is scattered across silos with no tagging or structure, expect significant data engineering work before the AI layer can be built.
Building something hard with AI or automation? I am open to talk.
Get in touch