AI · Automation · Engineering

15 Real AI Implementations and What They Shipped

By Lazar MilicevicJune 30, 20269 min read
Dark server room with illuminated racks representing real-world AI implementations and shipped systems

Most "AI case study" posts read like marketing decks. I want to do the opposite here. These are 15 systems I've actually built or shipped variations of, anonymized where needed, with the stack, the architecture choice that mattered, and the outcome the business actually cared about. Some are boring. The boring ones usually paid the most.

RAG and Knowledge Systems

The first cluster of work I get hired for is almost always retrieval. Companies have documents, tickets, and tribal knowledge sitting in five systems. They want one answer interface. The hard part is never the LLM, it's the retrieval quality and the eval loop.

1. Support knowledge RAG over Zendesk + Confluence

A B2B SaaS team had 11k macros, runbooks, and resolved tickets scattered across Zendesk and Confluence. Agents were copy-pasting from memory. We built a hybrid search RAG: pgvector for dense, Postgres full-text for sparse, then Reciprocal Rank Fusion to merge. Reranker was a small cross-encoder running on a single GPU box.

Stack: Postgres + pgvector, OpenAI embeddings, Claude for generation, a thin FastAPI service, Redis for cache.

Outcome that mattered: average handle time dropped on tier-1 tickets, and the team stopped onboarding new agents with a 4-week shadow period. The eval harness with 200 golden questions was what made leadership trust it.

2. Contract review assistant with clause-level retrieval

Legal team wanted "find me every MSA where indemnity is capped below 2x fees." Naive chunking failed because clauses span page boundaries. I switched to clause-aware chunking (split on clause headings, keep parent context), stored clause type as metadata, and retrieved by metadata filter first, semantic second.

The lesson: when your domain has structure, use the structure. Generic 512-token chunking throws away the only signal that matters.

3. Internal docs Q&A with citations

This one is unglamorous and I've built it five times. The trick that earns trust is forcing citations. Every answer renders with inline [1][2] markers linking back to the source paragraph. If the model can't cite, it shouldn't answer. I enforce this with a structured output schema and a post-generation check that rejects responses where any claim lacks a span match in the retrieved context.

Autonomous Content and SEO Systems

This is where I spend most of my own R&D budget. ContentStudio inside BizFlowAI is the production version.

4. ContentStudio: end-to-end content + SEO machine

The pipeline runs unattended on a schedule: keyword research, intent classification, outline generation, draft, internal-link planning, schema markup, publish via CMS API, then a measurement loop that pulls GSC data and feeds it back into topic selection.

EventBridge cron
  -> keyword harvester (Lambda)
  -> intent + difficulty scorer
  -> topic queue (Postgres)
  -> generator (multi-agent: researcher, writer, editor, SEO)
  -> publisher (CMS API)
  -> GSC poller (7/30/90 day windows)
  -> learning store -> back into topic selection

What matters in production: the editor agent has veto power, and the SEO agent runs structured checks (title length, H1 presence, internal links count, schema validity) before publish. No "vibes-based" gate.

5. Programmatic SEO with entity templates

For a marketplace client variant, we generated thousands of city + service pages. The mistake everyone makes: same template, swap nouns, get deindexed. We solved it by pulling unique entity data per page (real local statistics, real reviews summarized, real FAQ from internal search logs), then constraining the LLM to only use that data. The model writes prose, the data is the differentiator.

6. Self-learning content loop

Same site, but the topic selection adapts. Each published post is tracked for 90 days. If a post hits page 2 for a query it wasn't targeting, we spin up a follow-up post targeting that exact long-tail. If a post stalls at zero impressions for 60 days, it goes into the prune queue. This is the loop that compounds.

Serverless Integrations and Workflow Automation

The unsexy bread and butter. Companies pay well for these because they show up on the P&L immediately.

7. AWS + Zendesk SLA integration

A support org was missing SLA on a specific ticket class because routing rules in Zendesk couldn't express the condition. I built an EventBridge-driven Lambda that consumed Zendesk webhooks, applied the routing logic in code, and wrote back via the Zendesk API. First time the team hit SLA compliance on that class.

The architecture lesson: when SaaS rules engines hit their ceiling, a 200-line Lambda is usually cheaper than a custom workflow tool.

8. Four-system automation ecosystem

Procurement, CRM, billing, and a project tool, all glued together with EventBridge + Lambda + a small Postgres state store. The point wasn't AI, it was deterministic plumbing with LLM-assisted classification at exactly two steps (vendor categorization and invoice line-item extraction). Saved ~73 hours per month for the ops team, 192% Year-1 ROI on the build cost.

The rule I follow: AI only where deterministic code fails. Everything else is plain code. This keeps the system debuggable at 3am.

9. Analytics migration with LLM-assisted SQL translation

A reporting stack was costing too much on a legacy warehouse. Migration would have taken months because of ~400 hand-written SQL reports. I built an LLM-assisted translator that converted dialect A to dialect B, with a test harness that ran both versions against a sample and diffed the results. Human review only on diffs. Annual savings landed in the $30k to $60k range.

Agentic Workflows

Where the field is heading, and where most teams are burning money in 2026 by skipping the boring parts.

10. Customer onboarding agent

An agent that reads a signed contract PDF, extracts entitlements, provisions the tenant, configures SSO, invites users, and sends a personalized welcome. Tool use is constrained: the agent picks from 12 tools, each with strict JSON schemas and idempotency keys. State lives in Postgres, not in the agent's context.

Critical design choice: the agent never calls a write tool without a preceding read tool that confirms current state. This eliminates a whole class of "the agent did it twice" bugs.

11. Database query agent with row-level guardrails

I wrote about this approach before. The short version: never give the agent raw SQL access. It calls a query function that takes a parameterized intent ("revenue by region, last 30 days"), the function builds the SQL server-side with the user's row-level permissions baked in. The agent sees results, never the schema, never raw connection.

Stack: Claude with tool use, Postgres with row-level security, a TypeScript query layer that enforces tenant isolation.

12. Code review agent for a monorepo

Runs on every PR. It reads the diff, pulls related files into context, runs a checklist (security, test coverage on changed lines, public API changes, migration safety), and posts inline comments. The eval set is ~150 historical PRs with known issues. We re-run the eval on every prompt change. If precision on a category drops, the change doesn't ship.

Local LLMs and Privacy-Constrained Builds

13. On-prem RAG for a regulated workload

Data could not leave the customer's network. We ran Ollama with a quantized 70B model on two GPU nodes, embeddings via a local model, pgvector for storage. Latency was worse than a frontier API, accuracy on the domain was within 6 points of Claude on our eval set after some prompt tuning and a domain-specific reranker.

When local LLMs make sense: data residency is non-negotiable, query volume is high enough to amortize the GPU cost, and the domain is narrow enough that a smaller model can compete.

14. Voice transcription + structured extraction pipeline

Sales calls in, structured CRM updates out. Whisper for transcription (self-hosted), then a two-pass LLM extraction: pass one extracts entities and intents into JSON, pass two validates the JSON against a schema and re-prompts if invalid. Failures get queued for human review. About 94% of calls flow through without a human touching them.

15. Internal coding assistant with repo-aware context

Not Cursor, an internal version for a team that couldn't send code to a third party. Indexed the repo with tree-sitter for symbol-level chunks, served retrieval over gRPC, and wired it into VS Code via a custom extension. The win wasn't generation quality, it was the symbol-aware retrieval. Asking "where do we handle webhook retries" returned the exact functions, not a vibes-based search.

Patterns That Show Up in Every Build

After enough of these, the same five patterns repeat:

Pattern Why it matters
Eval set before code Without it, you cannot tell if a change improved anything
Structured outputs JSON schemas + validation eliminate 80% of "the model went off the rails" bugs
Tool use over freeform Constrained tools with strict schemas beat clever prompting
State outside the model Postgres is the agent's memory, the context window is scratch space
Human in the loop on the 5% Design the path for the cases the system shouldn't decide alone

The other thing that's consistent: the LLM is rarely the bottleneck. Retrieval, data quality, tool design, and eval rigor decide whether the system ships. I've replaced model providers mid-project more than once with almost no quality change, because the rest of the system was doing the real work.

What I'd Do If I Were Starting Today

If you're a CTO or head of engineering planning a real AI build in 2026, here's the order I'd run it in:

  1. Pick one workflow with measurable pain. Hours saved per month, tickets per day, conversion lift, something concrete. Not "AI strategy."
  2. Write the eval set first. 50 to 200 examples with expected outputs. This is your spec. Without it, every prompt change is a guess.
  3. Start with the strongest API model. Prove the workflow works. Optimize cost only after it ships.
  4. Constrain everything. Structured outputs, tool schemas, allow-listed actions. The model should have a small surface area.
  5. Put state in a database. Never in the context window. Idempotency keys on every write.
  6. Ship to a small group first. 10 users, two weeks, real feedback. Then scale.
  7. Instrument from day one. Log every prompt, every tool call, every retrieval result. You'll need this.

The teams that get burned are the ones that skip the eval set and go straight to "let's make it autonomous." Autonomy without measurement is a Slack thread full of regret.

If you're thinking about a build like one of these, or you've got a PoC that needs to survive contact with production, I'm at lazar-milicevic.com/#contact. More long-form pieces on RAG eval, agent design, and serverless AI architecture on the blog.

Frequently asked questions

What is the best architecture for a support knowledge base RAG system over Zendesk and Confluence?

For support knowledge RAG, I use a hybrid search approach: pgvector for dense embeddings, Postgres full-text search for sparse retrieval, and Reciprocal Rank Fusion to merge results, followed by a small cross-encoder reranker. The full stack is Postgres with pgvector, OpenAI embeddings, Claude for generation, FastAPI as the service layer, and Redis for caching. The LLM is rarely the bottleneck, retrieval quality is. What earns leadership trust is an eval harness with around 200 golden questions you can run on every change.

How do I prevent hallucinations in an internal documentation Q&A system?

I force citations on every answer. The model must render inline markers like [1][2] that link back to the exact source paragraph, and if it can't cite, it doesn't answer. I enforce this with a structured output schema and a post-generation check that rejects any response where a claim doesn't have a matching span in the retrieved context. This single constraint does more for trust than any prompt engineering trick.

How do I chunk legal contracts for RAG without losing clauses that span pages?

Generic 512-token chunking destroys the only signal that matters in contracts, which is clause structure. I use clause-aware chunking that splits on clause headings while keeping parent context, stores the clause type as metadata, and retrieves by metadata filter first and semantic similarity second. When your domain has structure, use the structure. This lets queries like 'find every MSA where indemnity is capped below 2x fees' actually work.

How do I do programmatic SEO with AI without getting deindexed by Google?

The mistake everyone makes is using one template and swapping nouns across thousands of pages, which Google deindexes as thin content. The fix is to pull unique entity data per page, things like real local statistics, summarized real reviews, and FAQs mined from internal search logs, then constrain the LLM to only use that data. The model writes the prose, but the differentiator is the data itself. Without unique underlying data, no amount of AI rewriting will save the pages.

When should I use AI versus deterministic code in a workflow automation system?

My rule is simple: use AI only where deterministic code fails, and use plain code for everything else. In a typical four-system integration across procurement, CRM, billing, and project tools, I'll glue it together with EventBridge, Lambda, and a small Postgres state store, and only invoke an LLM at the two steps where rules break down, like vendor categorization or invoice line-item extraction. This keeps the system debuggable at 3am and avoids the cost and unpredictability of AI calls in the hot path. One ecosystem I built this way saved roughly 73 hours per month and hit 192% Year-1 ROI.

Lazar Milicevic

Lazar Milićević

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts