LLM Evals: Hamel's Method vs a Lean In-House Setup

Every team I talk to about LLM evals is stuck in the same place: they know Hamel Husain's writing is the gold standard, they've bookmarked his "Your AI Product Needs Evals" post, and yet three months later they still have zero evals in production. The gap isn't intellectual. It's operational. Hamel's method is thorough because it has to be, and thorough is expensive.
I've run both paths. On BizFlowAI ContentStudio I started with a stripped-down in-house harness in week one, then grew into something closer to Hamel's approach as the surface area got bigger. Here's the honest side-by-side, and when each one is the right call.
Verdict in the first 200 words
If you have one LLM feature, one PM, and you need to ship this quarter: build a lean in-house eval framework in a week. Ten to thirty hand-labeled traces, a code-based assertion pass, and one LLM-as-judge rubric per failure mode. Total cost: roughly $200 in API spend, five engineering days, no new hires. You'll catch 70% of regressions and you'll know which 30% you're missing.
If you have three or more LLM features touching real customers, a compliance surface, or an agent loop that can take actions: invest in Hamel's full methodology. Error analysis, open coding, axial coding, custom judges per failure mode, held-out validation sets, and a review app your PMs will actually use. Budget six to ten engineering weeks and one dedicated eval owner. You'll pay $3k to $15k in judge API costs per month depending on volume, but you'll have a system that survives model swaps, prompt refactors, and the inevitable "we need to change the base model next quarter" conversation.
Neither approach is wrong. They're for different stages. Skipping the lean version because Hamel's is "the right way" is how teams end up with zero evals for a year.
The comparison table
| Dimension | Hamel's full methodology | Lean in-house framework |
|---|---|---|
| Time to first eval running | 3-6 weeks | 3-5 days |
| Engineering effort | 6-10 weeks, 1 dedicated owner | 1 engineer, part-time |
| Ongoing cost (judge API) | $3k-$15k/month | $100-$500/month |
| Trace volume needed | 500+ labeled | 20-50 labeled |
| Coverage | Multi-turn, multi-failure-mode, calibrated judges | Single-turn assertions + 1-2 judges |
| Failure mode discovery | Systematic (open + axial coding) | Ad hoc from support tickets and logs |
| Judge calibration | Explicit, with agreement metrics vs humans | Spot-checked, not measured |
| Team skill required | Senior engineer + product analyst mindset | Any mid-level engineer |
| Regression catch rate (my estimate) | 90-95% | 65-75% |
| Model swap resilience | High | Medium |
| Best for | Agents, RAG at scale, compliance-adjacent | Single-feature LLM apps, MVPs, PoCs |
The numbers on regression catch rate are from my own systems, not a study. Take them as directional. What I can say with certainty: the lean version catches the regressions that would embarrass you publicly. Hamel's version catches the ones that would embarrass you six months later when a customer with logs of their own shows you a pattern you missed.
What Hamel's approach actually is (the parts that matter)
If you've read Hamel's eval field guide, you know the outline. What people miss on first read is that the method is really three things stacked:
- Error analysis, done by looking at traces. Not vibes. You sit down with 100+ real traces, you open-code failure modes (call them whatever you want the first pass), then you axial-code (group them into a taxonomy). This is the step everyone skips. It's also the step that makes everything else work.
- Custom judges per failure mode. Not one giant "is this good?" judge. One judge per axis (hallucination, tone, refusal, tool-call correctness, citation accuracy) with a rubric written from real examples, calibrated against human labels until you hit reasonable agreement (Hamel talks about Cohen's kappa; I aim for 0.6+ on the failure modes that matter).
- A review app your non-engineers will actually use. Streamlit, Gradio, whatever. The point is that a PM or SME can label 50 traces in an afternoon without opening a Jupyter notebook.
The reason this works is that it turns "evals" from a vague quality problem into a concrete measurement problem. You know your top three failure modes, you have a judge for each, you have a held-out set, and you can compare v1 to v2 with a number.
The reason it's expensive is that step one alone (labeling 100+ traces) is a week of a senior person's time, and you have to repeat it every time your product surface changes.
What "lean in-house" actually means
Here's the version I build in a week when someone asks me to bootstrap evals for a single LLM feature. This is the honest minimum.
Step 1: Freeze 20-50 real traces
Pull them from logs. Real inputs, real outputs, real context. If you don't have logs yet, you have a bigger problem than evals. Save them as JSONL with input, output, retrieved_context (if RAG), and any tool calls.
Step 2: Label them yourself
Not your team. You. One afternoon. Two columns: pass (boolean) and failure_mode (free text). By trace 30 you'll see the same three or four failure modes over and over. Those are your real problems.
Step 3: Write code-based assertions for anything deterministic
def assert_no_hallucinated_citations(output, context):
cited = extract_citations(output)
for c in cited:
if c not in context:
return False, f"cited {c} not in context"
return True, None
def assert_response_length(output, max_tokens=800):
if count_tokens(output) > max_tokens:
return False, "too long"
return True, None
You'd be surprised how many "LLM quality" problems are just "the output is 3000 tokens when the UI expects 500" or "it invented a source URL." Code assertions catch these for free, run in milliseconds, and never drift.
Step 4: One LLM-as-judge per non-deterministic failure mode
Pick the two failure modes your code can't check. Write one judge each. Use a strong model (Claude Sonnet or GPT-4-class) as the judge. Rubric in the prompt, few-shot examples from your labeled set, output a JSON verdict with a reason.
JUDGE_PROMPT = """
You are evaluating whether a customer-support answer stays on-topic.
Rubric:
- PASS if the answer addresses the customer's actual question
- FAIL if the answer answers a different question or dodges
Examples:
[3 concrete examples from your labeled traces]
Return JSON: {"verdict": "PASS"|"FAIL", "reason": "..."}
"""
Step 5: Run it in CI on every prompt change
Cache judge outputs on unchanged (input, output) pairs so you're not burning API money re-judging identical traces. This one detail saves 80% of your judge cost.
That's the whole framework. A JSONL file, a runner script, code assertions, one or two judges, a CI hook. Total code: maybe 300 lines. Total cost: a couple hundred dollars in judge calls to get started.
Where the lean version breaks
I want to be fair about this because I've hit each of these walls.
Multi-turn agent traces. A single-turn assertion tells you nothing about whether an agent recovered from a bad tool call three turns ago. If you're building agentic workflows, you need trajectory-level evals, and that's where Hamel's methodology earns its keep.
Judge drift when you change models. Your Claude Sonnet judge and your GPT-4o judge will disagree on maybe 15-20% of edge cases. If you never measure agreement with humans, you don't know if your judge is calibrated or just consistent. Hamel's approach forces you to measure this. The lean version assumes it and hopes.
Failure modes you didn't think of. The lean version optimizes against the failure modes you saw in 30 traces. There are failure modes hiding in the 3000 traces you didn't label. Systematic error analysis surfaces them; ad hoc labeling doesn't.
PM/SME involvement. If your evals live in a Python script only you can run, you are the eval system. That's fine for a PoC. It's a liability for a product.
How to grow from lean to full without a rewrite
This is the part almost nobody writes about. The good news: if you build the lean version with the right data model, growing into Hamel's full methodology is additive, not a rewrite.
Three decisions to get right on day one:
- Store traces as structured JSONL from the start. input, output, retrieved_context, tool_calls, metadata (model, prompt_version, timestamp, trace_id). If you skimp here, you'll re-instrument in month three.
- Version your prompts. Every prompt gets a hash and a git commit. Every eval run records which prompt version it ran against. Without this, comparing v1 to v2 is guesswork.
- Separate the judge from the runner. Your assertion runner and your LLM judges should be pluggable. When you add three more judges next quarter, you don't touch the runner.
With those three in place, growing to Hamel's methodology looks like: label more traces, add more judges, add a Streamlit review app, start measuring judge-human agreement. Not a rewrite. An expansion.
What I'd do
If I were bootstrapping evals for a client tomorrow, this is the sequence, in order:
- Day 1-2: Instrument logging so every LLM call produces a structured trace. Version the prompts. This is non-negotiable and it's the highest-ROI thing in the whole list.
- Day 3: Pull 30 real traces. Label them myself. Identify the top three failure modes.
- Day 4: Write code assertions for anything deterministic. Write one LLM judge for the worst non-deterministic failure mode.
- Day 5: Wire it into CI. Set a threshold (say, 85% pass on the labeled set) and fail the build if a prompt change drops below it.
- Week 2-4: Ship. Watch production. Add traces from real user complaints to the eval set. Add a second judge if a second failure mode becomes prominent.
- Month 2-3, only if the product justifies it: Do the full Hamel-style error analysis. Label 200+ traces. Build the review app. Measure judge-human agreement. Add held-out validation.
The mistake I see most often is teams jumping to step 6 before step 1. They read Hamel's post, decide they need the full setup, and six weeks later they have a beautiful methodology document and zero evals running. Meanwhile, a competitor with a scrappy 300-line script is shipping prompt changes with confidence.
Hamel's method is right. It's also a destination, not a starting line. Start lean. Grow into it when the product tells you to.
If you're stuck between "we have no evals" and "we don't have three months to build them properly," or if you want a second pair of eyes on an existing eval setup that isn't catching what it should, I do this kind of work. Reach out at lazar-milicevic.com/#contact, or read more on evals and production LLM patterns on the blog.
Frequently asked questions
Should I use Hamel Husain's full eval methodology or build a lean in-house eval framework?
It depends on your stage and surface area, not on which is 'more correct.' If you have one LLM feature and need to ship this quarter, build a lean in-house framework in a week: 20-50 hand-labeled traces, code assertions, and one LLM-as-judge rubric per failure mode, for roughly $200 in API spend and five engineering days. If you have three or more LLM features, a compliance surface, or an agent loop that takes actions, invest in Hamel's full methodology with six to ten engineering weeks and a dedicated eval owner. Skipping the lean version because Hamel's is 'the right way' is how teams end up with zero evals for a year.
How much does it actually cost to run LLM evals in production?
For a lean in-house setup on a single LLM feature, expect around $100-$500 per month in judge API costs, plus about five engineering days upfront. For Hamel's full methodology at scale, budget $3k-$15k per month in judge API costs depending on trace volume, six to ten engineering weeks to build, and one dedicated eval owner ongoing. The biggest hidden cost in the full methodology is the human labeling time: 100+ traces per error-analysis cycle is roughly a week of a senior person's time, and you repeat it whenever the product surface changes.
What are the three core parts of Hamel Husain's LLM eval methodology?
Hamel's method is really three stacked components. First, error analysis done by reading 100+ real traces, open-coding failure modes, then axial-coding them into a taxonomy, this is the step most teams skip and it's what makes the rest work. Second, custom LLM-as-judge prompts per failure mode (hallucination, tone, tool-call correctness, citation accuracy, etc.), each calibrated against human labels, I aim for Cohen's kappa of 0.6+ on the modes that matter. Third, a review app in Streamlit or Gradio so PMs and SMEs can label traces without touching a notebook. Together these turn 'evals' from a vague quality problem into a concrete measurement problem.
What's the minimum viable LLM eval setup I can build in a week?
Start with three steps. Pull 20-50 real traces from your logs as JSONL with input, output, retrieved context, and any tool calls. Label them yourself in one afternoon with two columns, pass/fail and free-text failure mode, by trace 30 you'll see the same three or four problems repeat, and those are your real failure modes. Then write code-based assertions for anything deterministic (response length, citation validity, JSON schema) plus one LLM-as-judge rubric per recurring failure mode. This catches roughly 65-75% of regressions for about $200 in API spend and five engineering days.
Why should I use code-based assertions instead of only LLM-as-judge evals?
Because many 'LLM quality' problems aren't quality problems at all, they're deterministic failures that don't need a judge. A lot of production issues are things like 'the output is 3000 tokens when the UI expects 500' or 'the model invented a source URL that isn't in the retrieved context.' Code assertions catch these for free, run in milliseconds, cost nothing, and never disagree with themselves. Reserve LLM-as-judge for the genuinely subjective axes like tone, helpfulness, or nuanced correctness where a rule can't capture the intent.
Building something hard with AI or automation? I am open to talk.
Get in touch