AI · Automation · Engineering

RAG in Production: A Senior Engineer's Architecture Guide

By Lazar MilicevicJuly 5, 202610 min read

Rows of illuminated servers in a dark data center representing RAG production architecture

I shipped my first RAG pipeline three years ago and it was humbling. The demo worked beautifully on five documents. At five hundred, it returned garbage. Not because the math was wrong, but because I had built for retrieval and forgotten about ranking, chunk boundaries, and the hundred small decisions that separate a notebook from a system people rely on. This is the guide I wish someone had handed me then.

Retrieval-Augmented Generation, or RAG, is the architecture pattern where you search a knowledge base for relevant context, feed that context into an LLM's prompt, and let the model generate an answer grounded in your data instead of its training memory. It sounds simple. It is not simple. The engineering is in the joints.

Why RAG Beats Fine-Tuning for Knowledge Tasks

Fine-tuning teaches a model new behavior or style. RAG gives a model new facts at inference time. These solve different problems, and confusing them is the most expensive mistake I see teams make.

If your problem is "the model does not know our internal documentation," fine-tuning is the wrong tool. You are trying to encode facts into weights, which is slow, expensive, and brittle. Every document update means retraining or accepting staleness. Worse, fine-tuned models hallucinate with confidence because they cannot distinguish between what they learned in training and what you fed them last week.

RAG keeps your knowledge external. You update a database, the next query sees the new data. No retraining. The model grounds its answer in retrieved passages, and you can show the user exactly which sources were used. That traceability matters enormously in regulated industries, enterprise search, and any domain where "the AI said so" is not an acceptable answer.

The trade-off: RAG adds latency and complexity. You now own a retrieval pipeline, and its quality determines everything downstream. A great LLM with a mediocre retriever produces confident nonsense. I have seen this kill more projects than model choice ever did.

The Core Pipeline: Four Stages That Matter

Every production RAG system I have built breaks down into the same four stages. Get each one right and the system works. Cut corners on any one and the whole thing degrades.

Ingestion. You load source documents, split them into chunks, generate embeddings for each chunk, and store them in a vector database with metadata. This is where most teams optimize prematurely by fiddling with embedding models when their real problem is that their chunker split a table in half.

Retrieval. At query time, you embed the user's question, search the vector store for the nearest neighbors, and return the top candidates. This is your candidate pool, not your final context.

Reranking. You take those candidates and run them through a cross-encoder or a lightweight reranker model that scores query-document relevance more carefully than embedding similarity. This stage is the single highest-ROI improvement I have made to RAG pipelines. More on this below.

Generation. You assemble the reranked passages into a prompt with instructions to answer only from the provided context, send it to the LLM, and return the answer with citations.

Here is the skeleton of a generation prompt I use in production:

You are a technical assistant. Answer the user's question
using ONLY the provided context passages. If the context
does not contain the answer, say you do not know.

Do not speculate. Do not use prior knowledge.

Cite sources as [1], [2], etc. corresponding to passage IDs.

Context:
[1] {passage_text}
[2] {passage_text}
...

Question: {user_question}

That "say you do not know" line is doing more work than it looks like. Without it, the model will stretch thin context to produce a plausible-sounding wrong answer. With it, you get honest refusals, and honest refusals build trust.

Chunking Strategy: The Decision Everything Else Depends On

Chunking is the most under-discolved part of RAG. Teams spend days comparing embedding models and zero minutes thinking about chunk boundaries. I have fixed more broken RAG pipelines by changing the chunking strategy than by changing any other component.

The problem is simple. You retrieve at the chunk level. If a chunk is too small, the passage lacks context. If it is too large, the embedding captures a blurry average of too many topics and similarity scores drop. You also pay in tokens because large chunks fill your context window fast.

My defaults for production, after a lot of iteration:

Content Type	Chunk Size	Overlap	Strategy
Markdown docs	500-800 tokens	100 tokens	Split on headings, then by size
Code	By function/class	0	Never split mid-function
PDFs with tables	Extract tables separately	N/A	Store as structured text
Long-form articles	512 tokens	64 tokens	Sentence-boundary aware
Q&A / FAQ	One item per chunk	0	Natural boundaries

The overlap matters more than people expect. Without overlap, you get chunks that cut off mid-sentence, and the embedding for a half-sentence is meaningfully worse than one for a complete thought. I use 10 to 20 percent overlap as a floor.

One technique that has saved me repeatedly: store parent-document IDs on each chunk. When you retrieve a small, focused chunk, you can pull its parent document (or a larger window around it) into the generation context. This is called small-to-big retrieval, and it gives you the best of both worlds. Precise retrieval on small chunks, rich context for generation.

Vector Search Is Not Enough: Add Hybrid Search and Reranking

Pure vector search fails on three things that come up constantly in real applications: exact matches (product codes, error numbers, names), rare terms, and queries that mix keyword and semantic intent.

The fix is hybrid search. You run both a vector similarity search and a keyword search (BM25 or full-text), then merge the results using Reciprocal Rank Fusion (RRF). RRF is beautifully simple: you take the rank position of each document in each result set, compute 1 / (k + rank) for each, sum the scores, and sort. No training data needed, no tuning. I use k = 60 which is the standard default from the original paper.

In PostgreSQL with pgvector, this looks like running a vector query and a tsvector full-text query in the same request, then fusing the ranks in application code or a SQL function. Supabase makes this particularly clean because you get both capabilities in one database.

Then comes reranking. Vector search and BM25 are both fast but shallow. A cross-encoder reranker takes the query and each candidate document together, runs them through a transformer, and produces a relevance score that is dramatically more accurate. The cost is speed: cross-encoders are slow compared to vector search.

The pattern that works: retrieve 50 candidates with hybrid search, rerank them down to the top 5 to 10, and feed those to the LLM. You get the speed of approximate search for candidate generation and the precision of a cross-encoder for final selection.

In my own pipelines, adding a reranker typically improved answer quality more than switching from a decent embedding model to a state-of-the-art one. It is the cheapest win in RAG, and I see teams skip it because they do not know it exists.

What Breaks When You Scale: Production Pitfalls

Moving from a demo to a production RAG system surfaces problems that notebooks never reveal. Here are the ones I have hit repeatedly.

Stale embeddings after document updates. You update a source document but forget to regenerate its embeddings. The vector store now returns old chunks for a document that says something different. This is a silent failure. The system returns an answer, the answer is wrong, and nobody notices until a user complains. Build idempotent ingestion with content hashing. If the hash has not changed, skip. If it has, delete old chunks and re-embed.

Metadata filtering that destroys recall. You add a metadata filter (for example, "only search documents from department X") and suddenly recall drops because the vector index was not designed for heavy pre-filtering. Some vector databases handle this well, others degrade badly. Test with realistic filter selectivity before shipping.

Context window bloat from greedy retrieval. You retrieve 20 chunks "just to be safe" and now your prompt is 8,000 tokens of context, most of it irrelevant. The LLM's attention spreads thin, answer quality drops, and your API bill spikes. Fewer, better-ranked chunks beat more chunks every time. I default to 5 reranked passages for most use cases.

No evaluation loop. This is the biggest one. Without evaluation, you are changing chunk sizes and embedding models based on vibes. You need a set of test questions with expected answers or relevant passages, and you need to run retrieval metrics (recall@k, precision@k) and generation metrics (faithfulness, answer relevance) on every change. I build a small eval set of 50 to 100 question-answer pairs before touching anything in production. If you cannot measure it, you cannot improve it.

Ignoring query complexity. A single-hop question ("What is the refund policy?") and a multi-hop question ("Compare the refund policies of our Pro and Enterprise plans and tell me which is more flexible?") need different retrieval strategies. For multi-hop questions, consider query decomposition: ask the LLM to break the question into sub-questions, retrieve for each, and merge. This adds a round-trip but dramatically improves answers on complex queries.

A Note on Cost and Latency

A RAG pipeline is a chain of calls. Embedding the query (fast, cheap), vector search (milliseconds), optional keyword search (milliseconds), reranking (100 to 500ms for 50 candidates), and LLM generation (1 to 5 seconds depending on context length and model). Your total latency is typically dominated by the LLM call.

Cost depends mostly on the generation step. A query with 5 passages of roughly 400 tokens each plus a 200-token question and a 300-token answer is about 2,700 tokens per query. With Claude or GPT-4-class models, that adds up fast at scale. For high-volume systems, I route simple queries to smaller, faster models and reserve the expensive models for complex queries. A lightweight classifier or even a prompt-based complexity check can make that routing decision.

For local and privacy-sensitive workloads, I run Ollama with a local embedding model and a quantized LLM. The architecture does not change. The tradeoffs shift toward latency and hardware cost, but the pipeline patterns are identical.

What I Would Do Differently Starting Today

If I were building a new RAG system from scratch tomorrow, here is the order I would tackle things.

First, I would spend most of my early time on data quality and chunking, not on model selection. Clean, well-structured source documents with sensible chunk boundaries beat any embedding model upgrade. I would also build the evaluation set before writing any retrieval code. Fifty hand-written questions with expected passages is worth more than any benchmark comparison.

Second, I would start with hybrid search from day one. It is not harder to build than pure vector search, and it prevents the exact-match failure mode that makes demos look bad in front of stakeholders.

Third, I would add a reranker as soon as I have retrieval working. Cross-encoder rerankers are cheap to run, easy to add as a stage, and consistently the largest quality jump I see in practice.

Fourth, I would instrument everything. Log the query, the retrieved passages, the reranker scores, the final prompt, and the answer. When something goes wrong in production (and it will), you need the full trace to diagnose it. I treat each RAG query as a mini pipeline with observability at every stage.

Fifth, I would accept that RAG is never "done." Documents change, users ask unexpected questions, and models get updated. The system needs continuous evaluation and a feedback loop. I build a simple "thumbs up / thumbs down" or "was this helpful?" signal into every RAG interface I ship. That signal feeds back into the eval set and drives iterative improvement.

RAG is one of the few AI patterns that delivers reliable value in production today. It is not glamorous. It is plumbing. But good plumbing is what makes a building habitable, and good RAG architecture is what makes an LLM useful for real work. If you are heading into a RAG project and want to talk through architecture, chunking strategy, or evaluation design, reach out at lazar-milicevic.com/#contact. I am always glad to compare notes with engineers building the same things.

Frequently asked questions

Should I use RAG or fine-tuning when my LLM doesn't know my company's internal documentation?

Use RAG, not fine-tuning. Fine-tuning encodes facts into model weights, which is slow, expensive, and brittle, every document update requires retraining. RAG keeps your knowledge external in a database, so updates are instantly available at the next query with zero retraining. Fine-tuned models also hallucinate with confidence because they can't distinguish between training-time knowledge and new information, whereas RAG grounds answers in retrieved passages with full source traceability.

What are the main stages of a production RAG pipeline?

A production RAG pipeline has four stages: ingestion, retrieval, reranking, and generation. Ingestion involves loading documents, chunking them, generating embeddings, and storing them with metadata in a vector database. Retrieval searches the vector store for candidate passages, reranking applies a cross-encoder to score true query-document relevance, and generation assembles the reranked passages into a prompt for the LLM. Each stage must work well, a great LLM paired with a mediocre retriever will still produce confident nonsense.

How should I chunk documents for a RAG system?

Chunking is the most impactful yet under-appreciated decision in RAG architecture. For markdown docs, I use 500, 800 token chunks split on headings with 100 tokens of overlap; for code, I chunk by function or class and never split mid-function; for PDFs with tables, I extract tables separately as structured text. I always use 10, 20% overlap as a minimum to avoid cutting off mid-sentence, since a half-sentence produces a meaningfully worse embedding than a complete thought. I also recommend storing parent-document IDs on each chunk so you can use small-to-big retrieval, precise matching on small chunks, rich context from parent documents during generation.

Why does my RAG pipeline return bad results even with a good embedding model?

The most common cause is neglecting reranking and hybrid search. Pure vector search fails on exact matches like product codes or error numbers, rare terms, and queries that mix keyword and semantic intent. Adding a cross-encoder reranker is the single highest-ROI improvement I've made to RAG pipelines, because it scores query-document relevance far more accurately than embedding similarity alone. Hybrid search, combining vector similarity with keyword search, handles the cases where pure semantic matching breaks down.

How do I stop my RAG system from hallucinating answers?

The most effective technique is adding an explicit instruction in your generation prompt telling the model to say it does not know if the context doesn't contain the answer. In my production prompt, I include the lines 'If the context does not contain the answer, say you do not know. Do not speculate. Do not use prior knowledge.' Without this instruction, the model will stretch thin context to produce plausible-sounding wrong answers. With it, you get honest refusals, and honest refusals build user trust.

Lazar Milićević

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Work with me →

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts