How to Create an AI Agent: A Production Walkthrough

The first agent I shipped to production failed at 3am on a Sunday. It looped on a tool call, burned through $40 in tokens before my budget alarm fired, and left a half-written draft in the database with no way to resume. That night taught me more about agent design than any framework tutorial. Since then I have built a pattern I trust enough to leave running unattended for weeks at BizFlowAI, where agents research, write, optimize and publish content without me touching them.
This is that pattern, stripped down to what actually matters.
Start with the job spec, not the framework
Before you pick LangGraph, CrewAI, or roll your own, write the agent's job spec like you would for a junior engineer. One paragraph. What it owns, what it must never do, what "done" looks like, and which signals tell you it failed.
Here is the spec for one of my production agents:
The Topic Researcher owns generating a ranked list of 20 content topics per site per week. It reads from
keyword_poolandsearch_console_perf, writes totopic_queue. It must never publish, never call paid APIs more than 8 times per run, and must finish in under 6 minutes. Done = 20 topics with score >= 0.6 and zero duplicates against the last 90 days. Failure signal = empty queue after a run, or any topic flagged by the dedupe check.
If you cannot write this paragraph, do not build the agent. You will end up with a "do everything" prompt that hallucinates its way through ambiguous tasks. The job spec becomes your evaluation rubric later, so write it carefully.
Rule of thumb I use: if the spec needs more than 5 tools or more than 3 decision branches, it is two agents, not one.
Design the tools before you write the prompt
Most agent failures I have debugged were not prompt failures. They were tool failures. The model called a tool with wrong arguments, the tool returned a 4MB JSON blob, or two tools had overlapping responsibilities and the model picked the wrong one.
Treat tools like a public API you are shipping to a difficult customer. The customer is the LLM.
Three rules I follow:
- Each tool does one thing and returns a small, structured result. If
search_databasereturns 200 rows, the model will choke or pick poorly. Return 10 with ahas_moreflag and anext_cursor. - Tool names and parameter names are the prompt. A tool called
fetch_recent_topics(days: int, min_score: float)is self-explanatory. A tool calledget_data(query: str)is a coin flip. - Every tool has idempotency keys and a dry-run mode. When the agent retries after a timeout, you do not want duplicate publishes.
Here is the actual signature I use for a publishing tool:
def publish_post(
site_id: str,
draft_id: str,
idempotency_key: str, # hash of draft_id + content_hash
scheduled_at: datetime | None = None,
dry_run: bool = False,
) -> PublishResult:
"""Publishes a draft to the target CMS.
Returns PublishResult with url, published_at, and cms_post_id.
If idempotency_key was used in the last 24h, returns the original result.
"""
The idempotency key has saved me at least four times. EventBridge retries, Lambda cold-start timeouts, network blips: all of them caused duplicate execution attempts in production. Without the key I would have shipped duplicate content.
Write a prompt that survives contact with reality
I no longer write monolithic system prompts. I write a system prompt that is mostly constraints and a runtime context block that gets rebuilt every turn. The split matters because the system prompt is the contract and the context is the working memory.
My template:
SYSTEM PROMPT (stable, ~600 tokens):
- Role and goal (3 sentences max)
- Hard constraints ("never call publish_post without dry_run first on first attempt")
- Tool inventory with one-line guidance per tool
- Output format for the final answer (JSON schema)
- Stop conditions ("when topic_queue has 20 entries, call finalize and stop")
RUNTIME CONTEXT (rebuilt per turn):
- Current task ID and attempt number
- Tool call history compressed: last 3 calls in full, older ones summarized
- Relevant memory entries pulled from pgvector (top 5 by relevance)
- Budget left: tokens, tool calls, seconds
Two specific things I have learned the hard way:
Tell the agent its budget. When I added "you have 8 tool calls remaining and 4 minutes" to the runtime context, my average run cost dropped roughly 30%. Models are surprisingly good at rationing when they know the limit.
Make stop conditions explicit and machine-checkable. "Stop when the task is complete" is not a stop condition. "Stop when topic_queue count returned by count_queue() is >= 20" is.
Memory: short, long, and the part everyone gets wrong
Agents need three kinds of memory and most tutorials only cover one.
| Memory type | What it stores | Where I put it | TTL |
|---|---|---|---|
| Scratchpad | Current turn's reasoning, tool results | In-context, compressed each turn | Single run |
| Episodic | What happened in past runs (decisions, outcomes) | Postgres table, summarized | 30-90 days |
| Semantic | Facts the agent should "know" (brand voice, prior topics) | pgvector + BM25 hybrid (RRF) | Indefinite |
The part everyone gets wrong is episodic memory. Without it, your agent makes the same mistake every Tuesday. With it, you can write rules like "before generating a topic, check if a similar topic failed evaluation in the last 60 days, and if so, vary the angle."
For semantic memory I use Postgres with pgvector and a BM25 index, then combine results with Reciprocal Rank Fusion. Pure vector search consistently missed exact-match keywords ("Q3 pricing" returned posts about Q1). RRF is 30 lines of SQL and fixes it.
-- Simplified RRF combining vector and BM25
WITH vec AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rnk
FROM memory WHERE site_id = $2 LIMIT 50
),
bm AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank(tsv, plainto_tsquery($3)) DESC) AS rnk
FROM memory WHERE site_id = $2 LIMIT 50
)
SELECT id, SUM(1.0 / (60 + rnk)) AS score
FROM (SELECT * FROM vec UNION ALL SELECT * FROM bm) x
GROUP BY id ORDER BY score DESC LIMIT 10;
The control loop: the part that decides if you sleep at night
A naive agent loop is while not done: llm_call(); execute_tool(). That is how you get a 3am token explosion. Mine looks like this:
def run_agent(task, max_tool_calls=10, max_tokens=80_000, max_seconds=300):
state = load_or_init_state(task.id)
budget = Budget(max_tool_calls, max_tokens, max_seconds)
while not state.done:
if budget.exhausted():
return handoff(state, reason="budget")
context = build_context(task, state, budget.remaining())
response = llm.call(SYSTEM_PROMPT, context, tools=TOOLS)
budget.charge_tokens(response.usage)
if response.stop:
state.done = True
break
for call in response.tool_calls:
if not policy.allows(call, state):
state.append_tool_result(call, error="policy_denied")
continue
result = execute_tool(call, idempotency=state.run_id)
state.append_tool_result(call, result)
budget.charge_call()
persist_state(state) # so we can resume on crash
return state.final_output
Five things in there that matter more than they look:
persist_stateevery turn. When Lambda times out at 15 minutes, the next invocation resumes from the last persisted turn. No work lost.policy.allowsis a separate function, not a prompt instruction. Hard rules ("never publish before 9am UTC", "never call paid APIs if budget < $0.20") live in code, not in English.handoffis a real state, not a failure. When the budget runs out, the agent writes its current state and the reason to a queue. A human or another agent picks it up. Agents that cannot ask for help are agents that fail silently.- Idempotency tied to
run_id. Replays do not double-execute side effects. build_contextcharges into the budget too. Big context windows are deceptively expensive. I cap retrieved memory at 5 entries and summarize old tool calls aggressively.
Evaluation: the part that turns a demo into a product
If you cannot tell me your agent's success rate on a fixed eval set, you do not have a production agent. You have a demo that has not failed yet.
I run three evaluation layers:
Layer 1: Unit-level. Each tool has tests with golden inputs and outputs. Boring, fast, runs on every commit.
Layer 2: Trajectory eval. A frozen set of 30-50 tasks with expected outcomes. Run the full agent, score the trajectory and the final output. For scoring I use a mix: deterministic checks where possible (did the topic queue have 20 entries? was the schema valid?) and an LLM judge for subjective parts (was the topic on brand?). The LLM judge runs with a calibrated rubric and I spot-check 10% of its scores against my own judgment monthly.
Layer 3: Production telemetry. Every run logs: tool calls made, tokens used, wall time, budget exhausted, handoff reason, and a sample of final outputs. I look at the dashboard every Monday. Drift shows up in tool call counts before it shows up in output quality.
A real number from my own systems: when I added trajectory eval to the content agent and started gating deploys on it, my "weird output" rate dropped from roughly 1 in 25 runs to under 1 in 200. Not zero. Never zero. But low enough to leave running unattended.
Deployment: serverless, scheduled, and boring
I deploy almost every agent as a Lambda triggered by EventBridge on a schedule, with state in Postgres (Supabase), secrets in AWS Secrets Manager, and observability through CloudWatch + a thin custom dashboard. Nothing exotic.
A few opinions:
- Scale to zero or stay home. If your agent runs 4 times a day, a 24/7 container is a waste. Lambda + EventBridge costs me cents per agent per month.
- One agent per Lambda. Co-locating agents to "save cold starts" creates dependency hell. Cold starts are 2 seconds. You will live.
- Dead-letter queue for everything. Failed runs go to SQS with the full state. I review the DLQ weekly.
- Feature flags for tools. I gate every new tool behind a flag. New tool ships disabled, gets enabled for one site, evaluated, then rolled out. This has caught two tool-design mistakes before they hit all sites.
The full production stack for a typical agent in my BizFlowAI ContentStudio:
EventBridge (cron)
-> Lambda (agent runner, max 15 min)
-> Postgres (state, memory, queues)
-> Claude API or local Ollama (LLM)
-> Tool Lambdas (publish, fetch, analyze)
-> CloudWatch (logs, metrics, alarms)
-> SQS DLQ (failed runs)
-> Dashboard (Next.js, reads Postgres)
What I would do if I were starting today
If you are building your first production agent in 2026, my opinionated shortlist:
- Pick one workflow that already exists as a human task. Document it step by step. That document becomes your job spec and your eval set.
- Build the tools first, with tests. Stub the agent loop. You can test 80% of the system without an LLM in the mix.
- Use Claude or GPT for the main loop, not a local model. Local LLMs are great for embeddings and classification side-tasks; for the agent's reasoning you want the strongest model you can afford. The cost difference is smaller than the reliability difference.
- Hard budgets in code, not in prompts. Always.
- Evaluation before scale. Do not deploy to 10 use cases until you have 50 eval cases passing on 1.
- Plan the handoff path on day one. What happens when the agent gets stuck? Who or what picks it up? If the answer is "nothing", you will find out at 3am.
The agents I trust to run while I sleep are not the smartest ones. They are the ones with the tightest tool contracts, the most boring control loop, and the eval set I actually run.
If you are working on an agent that needs to leave the demo stage and survive in production, or you want a second pair of eyes on an architecture before you commit to it, I am happy to talk. You can reach me at lazar-milicevic.com/#contact, or browse more posts on the blog where I write about RAG, evaluation, and the unglamorous parts of shipping AI systems.
Frequently asked questions
How do I design an AI agent that's reliable enough for production?
I start with a written job spec, not a framework choice. The spec defines what the agent owns, what it must never do, the exact 'done' criteria, and the signals that indicate failure, for example, 'generate 20 ranked topics per week, never publish, max 8 paid API calls per run, finish in under 6 minutes.' If you can't write that paragraph clearly, you'll end up with a vague 'do everything' prompt that hallucinates through ambiguity. As a rule of thumb, if the spec needs more than 5 tools or 3 decision branches, split it into two agents. This spec also becomes your evaluation rubric later, so it's worth writing carefully.
Why do AI agents fail in production and how do I prevent it?
In my experience debugging production agents, most failures are tool failures, not prompt failures, the model calls a tool with wrong arguments, gets back a massive JSON blob it can't parse, or picks the wrong tool because two have overlapping responsibilities. I fix this by treating tools like a public API for a difficult customer (the LLM): each tool does one thing and returns small structured results (e.g., 10 rows plus a has_more flag instead of 200 rows), tool and parameter names are self-documenting like fetch_recent_topics(days, min_score), and every tool has an idempotency key plus a dry-run mode. The idempotency key alone has saved me from duplicate publishes caused by EventBridge retries, Lambda timeouts, and network blips at least four times.
How should I structure the system prompt for an AI agent?
I split the prompt into a stable system prompt (~600 tokens) and a runtime context block that gets rebuilt every turn. The system prompt holds the contract: role and goal in 3 sentences, hard constraints, a tool inventory with one-line guidance, the output JSON schema, and explicit stop conditions. The runtime context holds working memory: current task ID, compressed tool call history (last 3 in full, older summarized), top 5 relevant memory entries from vector search, and remaining budget. Two things made a measurable difference for me: telling the agent its budget ('8 tool calls remaining, 4 minutes') cut average run cost by roughly 30%, and making stop conditions machine-checkable like 'stop when count_queue() >= 20' instead of 'stop when complete.'
What types of memory does an AI agent need?
Agents need three kinds of memory, and most tutorials only cover the first. Scratchpad memory holds the current turn's reasoning and tool results, lives in-context, and is compressed each turn. Episodic memory stores what happened in past runs, decisions and outcomes, in a Postgres table with a 30, 90 day TTL, and it's the one everyone skips; without it, your agent repeats the same mistakes weekly. Semantic memory stores durable facts like brand voice and prior topics in pgvector with BM25, kept indefinitely. Episodic memory is what lets you write rules like 'before generating a topic, check if a similar one failed evaluation in the last 60 days and vary the angle.'
Should I use vector search or keyword search for AI agent memory?
I use both, combined with Reciprocal Rank Fusion (RRF). Pure vector search consistently missed exact-match keywords in my production agents, a query for 'Q3 pricing' returned posts about Q1 because the embeddings considered them semantically similar. Pure BM25 misses paraphrases and conceptual matches. RRF combines the rankings from both a pgvector similarity search and a BM25 index in about 30 lines of SQL, and it fixed the exact-match problem without sacrificing semantic recall. For any agent that retrieves from a knowledge base, hybrid retrieval with RRF is my default.
Building something hard with AI or automation? I am open to talk.
Get in touch