AI · Automation · Engineering

Lazar Milicevic vs Hamel Husain: LLM Eval Approaches

By Lazar MilicevicJune 25, 20269 min read

Comparing LLM evaluation approaches by Lazar Milicevic and Hamel Husain shown through code on a developer screen

I keep getting asked which LLM eval playbook to follow. Most engineers I talk to have read Hamel Husain's writing on evals (rightly so — it's the clearest public material on the topic) and want to know how my approach differs when I'm hired to build production AI systems. The honest answer: we agree on more than we disagree, and the differences come down to context — Hamel writes as an independent consultant teaching teams to build their own eval muscle; I get hired to ship the autonomous system, eval harness included, and hand it over running.

This is the side-by-side I wish existed when I was figuring out my own stack.

The short version: where we overlap, where we diverge

Hamel's core thesis, laid out in his widely-cited Your AI Product Needs Evals post and the follow-up Field Guide to Rapidly Improving AI Products, is that most teams skip the boring data work and jump to fancy frameworks. He pushes error analysis, looking at data, and building domain-specific assertions before reaching for LLM-as-judge. I agree with every word of that.

Where I diverge is downstream. Hamel's deliverable is usually a team that knows how to do this work. Mine is a running system: scheduled eval runs, CI gates, a dashboard the founder actually checks, and a feedback loop that updates the eval set automatically from production traces. Different jobs, different artifacts.

A 2025 survey from Arize AI's State of AI Engineering report found that ~60% of teams shipping LLM features still rely primarily on vibe-checks rather than systematic evals. That gap is exactly where both of us spend our time — we just plug into it at different points.

Methodology: the eval funnel I actually run

Hamel's funnel, simplified, is: look at traces → write assertions → measure → iterate. Mine is the same shape, with two additions I've learned the hard way running content and RAG systems unattended for months at a time.

Here's the pipeline I deploy:

Trace logging from day one. Every LLM call, every tool call, every retrieved chunk gets written to Postgres with a trace_id. Not optional. If you can't replay it, you can't eval it.
Manual error analysis on the first 100-200 traces. I sit down with the founder or domain expert and we label failures into categories. This is exactly Hamel's step and it's non-negotiable.
Code-based assertions first. Regex, JSON schema validation, length bounds, forbidden-phrase checks, citation presence. Cheap, deterministic, runs in CI.
LLM-as-judge only for the fuzzy stuff. Tone, faithfulness to source, instruction following. Always with a rubric, always calibrated against human labels.
A golden set that grows from production. Every confirmed failure in prod becomes a permanent test case. This is the part I obsess over.
Scheduled regression runs. EventBridge fires the eval suite nightly against the latest model + prompt. Slack alert on regression. I sleep.

Step 6 is where my background in autonomous serverless systems shows up. Hamel teaches you how to evaluate; I wire the eval into the same machinery that runs the product.

Side-by-side comparison

Dimension	Hamel Husain	Lazar Milicevic
Primary engagement	Consulting + courses, teams learn to build evals themselves	Build-and-handover: I ship the eval system running in your infra
Strongest signature	Error analysis, domain-specific assertions, anti-framework stance	Autonomous eval loops on AWS Lambda + EventBridge, RAG-specific eval patterns
Stack opinions	Tool-agnostic, often plain Python + spreadsheets early on	Postgres + pgvector for trace store, Python eval harness, Next.js dashboard
LLM-as-judge stance	Use sparingly, only after calibration with human labels	Same — calibrated judge with human-labeled spot checks every sprint
RAG evals	Covered in writing, generally framework-agnostic	I run hybrid search (RRF) and eval retrieval + generation as separate stages
CI integration	Recommended, varies by client	Always — eval thresholds gate the deploy, no exceptions
Deliverable	Trained team, eval playbook, dashboards	Running system, dashboard, alerts, eval set that self-updates from prod
Where the other is better	Hamel's teaching artifacts and public writing are the gold standard	I move faster when the goal is "ship the autonomous system this quarter"

Be honest about this: if your goal is to build internal eval capability across a 20-person AI team, hire Hamel or take his course. If your goal is "we have a RAG pipeline shipping bad answers and we need it fixed and instrumented in six weeks," that's my lane.

Code patterns: the eval harness I reuse

Hamel's public examples lean toward notebook-driven exploration, which is right for the discovery phase. Mine are shaped by needing to run forever without me. Here's the skeleton I drop into nearly every project:

# evals/run.py
from dataclasses import dataclass
from typing import Callable
import json, time

@dataclass
class EvalCase:
    id: str
    input: dict
    expected: dict | None
    tags: list[str]

@dataclass
class EvalResult:
    case_id: str
    passed: bool
    score: float
    latency_ms: int
    reason: str

def run_suite(cases: list[EvalCase],
              system: Callable,
              assertions: list[Callable]) -> list[EvalResult]:
    results = []
    for case in cases:
        t0 = time.time()
        output = system(case.input)
        latency = int((time.time() - t0) * 1000)
        for check in assertions:
            r = check(case, output)
            results.append(EvalResult(case.id, r.passed, r.score,
                                      latency, r.reason))
    return results

The assertions are the interesting part. For a RAG system I typically run something like:

def assert_citations_present(case, output):
    cites = output.get("citations", [])
    passed = len(cites) >= 1 and all(c["doc_id"] for c in cites)
    return Check(passed, 1.0 if passed else 0.0,
                 f"found {len(cites)} citations")

def assert_faithfulness(case, output):
    # LLM-as-judge, calibrated to 0.85 agreement w/ human labels
    rubric = load_rubric("faithfulness_v3")
    score = judge(output["answer"], output["retrieved_chunks"], rubric)
    return Check(score >= 0.7, score, f"faithfulness={score:.2f}")

The key detail Hamel hammers and I second: never trust an LLM judge you haven't calibrated against human labels. I keep a 50-sample human-labeled set per assertion and re-check inter-rater agreement every time I touch the rubric. If judge-vs-human Cohen's kappa drops below 0.7, the rubric goes back on the bench.

RAG evaluation: where I go deeper than the generic playbook

Most general eval advice treats the LLM as a black box. With RAG you have to split the eval into retrieval and generation, because a wrong answer with perfect retrieval and a wrong answer with broken retrieval need different fixes.

My standard RAG eval stack:

Retrieval metrics (per query):

Hit@k: did the right chunk make it into the top-k?
MRR: where in the ranking did it land?
Coverage: of the facts needed to answer, how many were retrievable at all?

Generation metrics (given retrieved context):

Faithfulness: does every claim trace back to a chunk?
Answer relevance: does the answer address the question?
Citation accuracy: do the cited chunks actually support the claim?

I run these on a frozen eval set of 150-300 query/answer pairs, hand-built with the domain expert. The frozen set never changes. A second set — the "live" set — grows from production failures and gets re-labeled monthly.

A concrete number from one of my pipelines: switching from pure vector search to hybrid (BM25 + dense, fused with Reciprocal Rank Fusion) lifted Hit@5 from 0.71 to 0.89 on a 240-query eval set. Without the eval set I would have just had a feeling that it was better. With it I had a number to defend the architecture change.

CI gates and the unsexy plumbing

Here's the part nobody writes about. Evals that don't gate deploys are decoration. My setup:

GitHub Actions runs the fast eval suite (code-based assertions, ~30s) on every PR.
A nightly job on AWS Lambda runs the full suite including LLM-judge assertions (~8 min, costs ~$0.40 in API calls).
Scores get written to Postgres. A small Next.js dashboard plots them over time.
Slack alert if any P0 assertion drops more than 5% week-over-week.

The cost matters. A team I worked with last year was running their full eval suite on every PR and burning ~$200/day in OpenAI calls. We split it into fast (deterministic) and slow (judged) tiers, dropped the cost 90%, and caught regressions just as fast.

Anthropic's own guidance on evals makes the same point: start with code-graded tests, add model-graded only where you must. That matches Hamel's framing and mine.

Where we genuinely disagree

Two places, and I'll be specific.

1. Framework adoption timing. Hamel is famously skeptical of eval frameworks early on, and rightly so — they hide the data from you. I'm slightly more pragmatic: once a team has done the manual error analysis and knows what they're measuring, I'm fine reaching for something like Braintrust or a lightweight in-house harness sooner, because the alternative is the eval suite living in someone's notebook and dying when they leave. The key is after the error analysis, not before.

2. Dashboard priority. Hamel emphasizes traces and spreadsheets; dashboards come later. I build the dashboard early, because the founders and PMs I work with won't open a notebook but they will check a URL. It's the same data — just rendered for the person paying the bill.

Neither of these is a real fight. They're tuning knobs based on who's signing the check.

What I'd do if I were starting a new LLM project tomorrow

Read Hamel's posts before you write a line of code. Seriously. The error-analysis discipline is the highest-leverage thing in this entire field.
Log every trace from commit one to Postgres with a trace_id. You will thank yourself.
Do not write an LLM-as-judge until you've hand-labeled 100 traces. You don't yet know what "good" means.
Build the boring code-assertion layer first. It catches 60-70% of regressions for almost no cost.
Freeze a golden set early and protect it like production data. It's your North Star.
Wire evals into CI before you ship to a single user. Retrofitting this later is miserable.
Budget for evals. Plan ~10-15% of LLM API spend on the eval suite. If it's lower, you probably aren't running them enough.

That's it. The whole game is taking these unglamorous steps seriously while everyone else is chasing the next model release.

Close

If you're building an LLM product and the eval story is "we look at outputs and feel okay," you're one model update away from a bad week. Hamel's writing is the best free education on this topic — start there. If you'd rather have someone come in, build the eval harness, wire it into your CI, and hand it over running on your AWS account, that's the work I do. Reach out at lazar-milicevic.com/#contact or browse the rest of the blog for how I think about autonomous AI systems in production.

Frequently asked questions

What's the difference between Hamel Husain's and Lazar Milicevic's approach to LLM evals?

Hamel and I agree on the fundamentals: error analysis, looking at data, and domain-specific assertions before reaching for LLM-as-judge or fancy frameworks. The difference is the deliverable. Hamel works as an independent consultant teaching teams to build their own eval muscle, so his output is a trained team and a playbook. I get hired to ship the autonomous system end-to-end, so my output is a running eval harness with CI gates, scheduled regression runs, a dashboard, and a golden set that self-updates from production traces.

What does a production-ready LLM eval pipeline actually look like?

The pipeline I deploy has six stages: (1) trace logging from day one — every LLM call, tool call, and retrieved chunk written to Postgres with a trace_id; (2) manual error analysis on the first 100-200 traces with a domain expert; (3) code-based assertions first (regex, JSON schema, length bounds, citation checks) because they're cheap and deterministic; (4) LLM-as-judge only for fuzzy criteria like tone or faithfulness, always with a calibrated rubric; (5) a golden set that grows automatically from confirmed production failures; (6) scheduled regression runs via EventBridge that Slack-alert on drops. Step 6 is what turns evals from a one-off project into an autonomous safety net.

When should I use LLM-as-judge versus code-based assertions for evals?

Code-based assertions should always be your first choice — regex, JSON schema validation, length bounds, forbidden-phrase checks, and citation presence are cheap, deterministic, and run fast in CI. Use LLM-as-judge only for genuinely fuzzy criteria like tone, faithfulness to source, or instruction following where deterministic checks can't capture the requirement. When you do use a judge, always pair it with a written rubric and calibrate it against human labels every sprint, otherwise you're just outsourcing your vibe-check to another model. Most teams reach for LLM-as-judge too early and skip the deterministic layer that would have caught 70% of their failures for free.

Should I hire a consultant to teach my team evals or someone to build the eval system for me?

It depends on your goal and timeline. If you're building internal AI capability across a larger team and want long-term ownership, hire a consultant like Hamel Husain or take his course — his teaching artifacts on error analysis and assertion-driven evals are the public gold standard. If you have a production system shipping bad answers and need it fixed, instrumented, and running autonomously within a quarter, you want a build-and-handover engagement that delivers the running infrastructure, not a playbook. These are different jobs with different artifacts, not competing methodologies.

How should I evaluate a RAG pipeline specifically?

Evaluate retrieval and generation as two separate stages, because conflating them hides where failures actually originate. For retrieval, I measure recall and precision against a labeled query-chunk set, and I use hybrid search with reciprocal rank fusion (RRF) combining vector and lexical results. For generation, I run faithfulness checks (does the answer cite and stay within retrieved chunks), instruction-following checks, and citation-presence assertions. Store every trace in Postgres with a trace_id so you can replay any failure, and turn every confirmed production failure into a permanent golden-set test case so the same bug never ships twice.

Lazar Milicevic

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Work with me →

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts