AI · Automation · Engineering

LLM-as-a-Judge: Build, Calibrate, Trust It

By Lazar MilicevicJune 25, 202610 min read
Wooden gavel on desk symbolizing LLM-as-a-Judge evaluation, calibration and trust in AI systems

The first time I shipped an LLM-as-a-judge, it gave me a 4.6/5 average on outputs that, when I actually read them, were maybe a 3. The judge loved verbose, hedged answers. My users hated them. That gap — between what the model scores and what a human would score — is the whole problem with this technique, and the reason most teams quietly stop trusting their evals after a month.

This post is the recipe I now use to build judges I can actually defend in a review. It includes Cohen's kappa numbers from a 500-sample run we did on a RAG support agent earlier this year, the calibration steps that moved kappa from 0.31 to 0.74, and the code you need to reproduce it.

What LLM-as-a-judge is, and the three places it fails

LLM-as-a-judge is using a strong model (Claude Opus, GPT-4o, etc.) to score outputs from another model against a rubric, instead of paying humans to label every sample. It's how teams scale evaluation past a few hundred examples. Done right, a judge correlates well with human ratings and gives you a cheap, fast signal for regression testing, A/B prompt changes, and gating deploys.

It fails in three specific ways, and you should assume all three until you measure otherwise:

  1. Position bias. Given two answers A and B in a pairwise comparison, the judge prefers whichever you showed first. A 2023 paper from Zheng et al. ("Judging LLM-as-a-Judge with MT-Bench") documented this at roughly 60% preference for position 1 on weaker judges. I see it consistently with GPT-4-class models too.
  2. Verbosity bias. Longer, more confident answers score higher even when they're wrong. This is the one that burned me.
  3. Self-preference. A judge model rates outputs from its own family higher. Don't let GPT-4 judge GPT-4 outputs if you can avoid it.

The fix is not a better prompt alone. It's a measurable calibration loop with a human-labeled gold set, and that's what the rest of this post is about.

Step 1: Build a 100-200 sample gold set before you touch a judge prompt

You cannot calibrate without ground truth. The first time, skip the judge entirely and label 100-200 outputs by hand. I use 150 as my default. Smaller than 100 and your kappa confidence intervals are too wide to be useful; larger than 200 is diminishing returns for the first pass.

Rules I follow when building the gold set:

  • Stratified sampling: pull from real production traffic, not synthetic prompts. Bucket by intent (e.g. for a support agent: billing, technical, account, returns) and sample proportionally.
  • Two independent labelers minimum. I label half, a domain expert labels half, we both label a 30-sample overlap. The overlap is what tells me whether the humans even agree on the rubric.
  • Binary or 3-point scales, not 5-point. Humans can't reliably distinguish a 3 from a 4 on a Likert scale. Inter-rater agreement collapses. Use Pass/Fail, or Pass/Borderline/Fail. You can always aggregate later.

On my 30-sample human overlap for the support agent eval, our initial Cohen's kappa between the two human labelers was 0.68. That was the ceiling — no judge can beat human-human agreement on the same rubric. If your humans only agree at kappa 0.5, your rubric is the problem, not your judge.

Step 2: Write the judge prompt around failure modes, not virtues

Most judge prompts I see in the wild ask vague things like "rate this answer for helpfulness from 1-5". That's how you get the 4.6/5 average. Helpful is undefined, so the model picks the easiest proxy: length and tone.

The judge prompt that finally worked for me does three things:

  1. Defines failure first. I list, in order of severity, what a failed answer looks like. Hallucinated citation, wrong tool call, missing escalation, ignored constraint.
  2. Forces chain-of-thought before the verdict. The judge must write its reasoning before emitting the score. Inverting this order — score first, justify after — drops kappa by ~0.15 in my tests because the model anchors on the score.
  3. Returns structured JSON with per-criterion scores, not one overall number. Then I aggregate in code, not in the prompt.

Here's the skeleton:

JUDGE_PROMPT = """You are evaluating a customer support agent's answer.

GOLD ANSWER (from a senior agent):
{gold}

AGENT ANSWER (under evaluation):
{candidate}

Evaluate against these failure modes, in order:
1. FACTUAL_ERROR: Does the answer contradict the gold answer on any verifiable fact?
2. MISSING_INFO: Does it omit a critical step or condition present in the gold?
3. UNSAFE_ACTION: Does it recommend an action the gold answer warns against?
4. STYLE: Tone, length, clarity. (Lowest priority - never let style override 1-3.)

For each, output PASS or FAIL with one sentence of evidence quoting the answer.
Then output an OVERALL verdict: PASS only if 1, 2, and 3 are all PASS.

Reason step by step before the JSON. Return:
{{
  "reasoning": "...",
  "factual_error": "PASS|FAIL",
  "factual_error_evidence": "...",
  "missing_info": "PASS|FAIL",
  "missing_info_evidence": "...",
  "unsafe_action": "PASS|FAIL",
  "unsafe_action_evidence": "...",
  "style": "PASS|FAIL",
  "style_evidence": "...",
  "overall": "PASS|FAIL"
}}
"""

Notice what's not in there: no 1-5 scale, no "helpfulness", no "is this a good answer". The judge's job is to detect specific failures against a reference, not to render aesthetic judgment.

Step 3: Measure agreement with Cohen's kappa, not accuracy

Raw accuracy lies. If 80% of your gold set is PASS and the judge says PASS to everything, you get 80% "accuracy" and zero signal. Cohen's kappa corrects for chance agreement and is the right metric here.

from sklearn.metrics import cohen_kappa_score, confusion_matrix

human_labels  = [...]  # 150 PASS/FAIL from your gold set
judge_labels  = [...]  # 150 PASS/FAIL from the judge on same samples

kappa = cohen_kappa_score(human_labels, judge_labels)
cm = confusion_matrix(human_labels, judge_labels, labels=["PASS", "FAIL"])
print(f"Kappa: {kappa:.3f}")
print(cm)

Reading kappa (Landis & Koch, 1977 — still the standard reference):

Kappa Interpretation Ship it?
< 0.20 Slight No, your judge is noise
0.21-0.40 Fair No, but you can debug
0.41-0.60 Moderate Maybe, for low-stakes signals
0.61-0.80 Substantial Yes, for most production use
0.81-1.00 Almost perfect Rare, suspect overfitting

For reference, on the 500-sample run for the support agent — 150 gold + 350 production samples spot-checked — my numbers across iterations:

Iteration Change Kappa False PASS rate
v1 Likert 1-5, single "helpfulness" score 0.31 38%
v2 Switched to PASS/FAIL with failure modes 0.52 19%
v3 Added required reasoning before verdict 0.61 14%
v4 Two-judge ensemble (Claude + GPT-4o), tiebreak by stricter 0.74 6%

That v1 to v4 jump is what calibration buys you. None of it required a better base model — it required a better protocol.

Step 4: Use a two-judge ensemble for anything that gates deploys

The single biggest jump in my numbers came from running two judges from different model families and treating disagreements as automatic FAILs (or routing them to a human queue).

def ensemble_judge(gold, candidate):
    v1 = claude_judge(gold, candidate)   # Anthropic
    v2 = openai_judge(gold, candidate)   # OpenAI
    if v1["overall"] == v2["overall"]:
        return v1["overall"], "agreement"
    return "FAIL", "disagreement_routed_to_human"

This kills self-preference bias because no single model family controls the verdict. It also gives you a cheap "uncertainty" signal — the disagreement rate. On my run, disagreement rate was 11% of samples, and those 11% were exactly the borderline cases worth a human's time. The other 89% the judges agreed on, and they agreed with humans 78% of the time on FAIL and 96% on PASS.

The cost: roughly 2x inference. For me that's a few dollars per eval run, which is nothing compared to the cost of shipping a regression.

Step 5: Re-calibrate on a schedule, not when something breaks

Judges drift. The underlying model gets updated, your product changes, edge cases shift. I re-label a 50-sample slice every month and recompute kappa. If kappa drops more than 0.10 from baseline, I rebuild the gold set and re-tune the prompt.

I also log every judge verdict with its reasoning to a Postgres table so I can audit later. When a customer complaint comes in, the first thing I check is what the judge said on similar samples that week. About a third of the time, the judge had already flagged the failure mode and the alert threshold was wrong — not the judge.

CREATE TABLE judge_verdicts (
  id BIGSERIAL PRIMARY KEY,
  trace_id TEXT NOT NULL,
  candidate_model TEXT NOT NULL,
  judge_model TEXT NOT NULL,
  overall TEXT NOT NULL,
  per_criterion JSONB NOT NULL,
  reasoning TEXT NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON judge_verdicts (trace_id);
CREATE INDEX ON judge_verdicts (created_at);

That table has saved me twice already this year. Once for a real regression, once for a customer who was wrong about what the agent had said.

The three mistakes I see most often

  • Trusting the judge before measuring kappa. I've reviewed eval pipelines where teams have been shipping based on judge scores for six months and have never once labeled a gold set. The judge could be a random number generator. They wouldn't know.
  • Pairwise comparison without position-swapping. If you do A-vs-B judging, run every pair twice with positions swapped, and only count a "win" if it wins both orderings. This eliminates position bias at the cost of 2x inference. Worth it.
  • Optimizing the judge to agree with itself across runs. I've seen teams reduce judge temperature to 0 and call it "consistent". It is consistent — consistently biased in the same direction. Lower temperature reduces variance, not bias. The fix for bias is the rubric and the gold set, not the sampling parameter.

What I'd do if I were starting today

If you're standing up evals for an LLM app this week, here's the order I'd go in:

  1. Day 1-2: Label 150 production samples by hand with one other person. Compute human-human kappa. If it's below 0.6, fix the rubric before anything else.
  2. Day 3: Write a PASS/FAIL judge prompt structured around failure modes, not virtues. Run it on the 150 samples. Compute kappa.
  3. Day 4: Iterate the prompt until kappa hits 0.6+. Don't ship below that.
  4. Day 5: Add a second judge from a different family. Route disagreements to a human queue.
  5. Ongoing: log every verdict, re-calibrate monthly, treat the judge as a system you maintain — not a one-off prompt.

The thing I want to push back on hardest: a judge with kappa 0.3 is worse than no judge, because it gives you a false sense of coverage. Measure it. If it's not above 0.6, fix it or throw it out.

This is exactly the kind of evaluation infrastructure I build into the autonomous content and RAG systems I ship through BizFlowAI — judges that run on every output before publish, with the kappa numbers tracked over time so we know when the eval itself is drifting. If you're putting an LLM app into production and want a second pair of eyes on the eval setup, find me at lazar-milicevic.com/#contact. And if you want the longer version of the calibration playbook, more posts on the blog.

Frequently asked questions

What is LLM-as-a-judge and when should I use it?

LLM-as-a-judge is the practice of using a strong model (like Claude Opus or GPT-4o) to score outputs from another model against a rubric, instead of paying humans to label every sample. I use it to scale evaluation past a few hundred examples — for regression testing, A/B prompt comparisons, and gating deploys. Done right, the judge correlates well with human ratings and gives you a cheap, fast quality signal. Done wrong, it produces inflated scores that don't match user perception, which is why calibration against a human gold set is non-negotiable.

What are the main failure modes of LLM judges?

From my experience, there are three failures you should assume are present until you measure otherwise. First, position bias: in pairwise comparisons, judges prefer whichever answer was shown first (Zheng et al. measured ~60% preference for position 1 on weaker judges). Second, verbosity bias: longer, more confident answers score higher even when wrong. Third, self-preference: a judge rates outputs from its own model family higher, so don't let GPT-4 judge GPT-4 outputs if you can avoid it. Better prompts alone won't fix these — you need a measurable calibration loop against human-labeled data.

How big should my human-labeled gold set be for calibrating an LLM judge?

I default to 150 samples, with 100-200 as the practical range. Below 100 your Cohen's kappa confidence intervals get too wide to be useful, and above 200 you hit diminishing returns for a first calibration pass. Use stratified sampling from real production traffic (not synthetic prompts), bucketed by intent and sampled proportionally. Use at least two independent labelers with a 30-sample overlap so you can measure human-human agreement — that overlap kappa is the ceiling your judge can ever achieve.

Why should I use Cohen's kappa instead of accuracy to evaluate an LLM judge?

Raw accuracy is misleading because it doesn't correct for class imbalance. If 80% of your gold set is PASS and your judge says PASS to everything, you get 80% accuracy and zero actual signal. Cohen's kappa corrects for chance agreement, so it tells you whether the judge is really tracking the rubric or just guessing the majority class. I treat kappa above 0.7 as trustworthy for production gating, around 0.4-0.7 as usable with caution, and below 0.4 as a sign the judge prompt or rubric needs rework.

How should I structure an LLM judge prompt to get reliable scores?

Skip vague rubrics like 'rate helpfulness 1-5' — those produce inflated averages because the model defaults to scoring length and tone. Instead, do three things: define specific failure modes in order of severity (factual error, missing info, unsafe action, then style), force chain-of-thought reasoning before the verdict (scoring first then justifying drops my kappa by ~0.15), and return structured JSON with per-criterion PASS/FAIL plus evidence quotes. Aggregate the overall verdict in code, not in the prompt. Use binary or 3-point scales because humans can't reliably distinguish a 3 from a 4 on a 5-point Likert.

Lazar Milicevic

Lazar Milicevic

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts