AI · Automation · Engineering

RAG Evaluation Metrics: A 12-Point Checklist (2026)

By Lazar MilicevicJune 25, 20269 min read

Analytics dashboard with charts and metrics representing RAG evaluation checklist for retrieval systems

Last month I spent two weeks debugging a RAG pipeline that scored 0.91 on answer relevancy and still hallucinated client account numbers in production. The metric was lying to me, or more accurately, I was asking it the wrong question. That experience pushed me to rebuild how I evaluate retrieval-augmented systems from scratch, and to benchmark the popular eval stacks against each other on the same corpus.

This is the checklist I now use before any RAG system ships, plus the numbers I got running five evaluation frameworks against an identical 500-query test set.

The 12 metrics that actually matter

A useful RAG evaluation covers three layers: retrieval, generation, and the seam between them. Most teams measure two and skip the third, which is where production failures hide. Here is the full checklist I run before a RAG goes live.

Retrieval layer

Context precision — of the top-k chunks returned, what share are actually relevant to the question.
Context recall — of all the relevant chunks in the corpus, what share appear in top-k.
Context entity recall — for entity-heavy queries (names, IDs, SKUs), how many required entities show up in retrieved context.
Retrieval MRR@10 — mean reciprocal rank of the first relevant chunk. Cheap to compute, catches ranking regressions fast.

Generation layer

Faithfulness / groundedness — share of claims in the answer that are supported by the retrieved context. This is the hallucination metric.
Answer relevancy — does the answer actually address the question asked, not a tangent.
Answer correctness — agreement with a gold answer, when one exists.
Citation accuracy — when the model cites chunk IDs, do the citations actually contain the cited claim.

System layer (the one most teams skip)

Refusal rate on out-of-scope queries — does the system say "I don't know" when the corpus has no answer.
P95 end-to-end latency under realistic concurrency.
Cost per answered query in dollars, including embedding, retrieval, rerank, and generation tokens.
Drift over time — same eval set re-run weekly. If faithfulness drops 5 points week-over-week, something upstream changed.

If you only have time for four, run faithfulness, context precision, refusal rate on out-of-scope, and citation accuracy. Those four catch about 80% of what bites you in production.

Benchmarking five RAG eval stacks on the same corpus

I wanted real numbers, not vendor claims. So I built a 500-query test set against a 12,000-document technical knowledge base (mixed PDFs, Markdown, HTML), held the retrieval pipeline constant (BGE-large embeddings, pgvector, RRF hybrid with BM25, cross-encoder rerank), held the generator constant (Claude Sonnet), and varied only the evaluation framework.

Each framework scored the same 500 (query, retrieved_context, answer) tuples. I then compared each framework's judgments to a human-labeled gold set on 100 of those 500 queries.

Eval stack	Faithfulness agreement w/ human	Context precision agreement	Cost per 500-query run	P95 eval latency / query	Notes
Ragas 0.2	0.87	0.82	$4.10	3.1s	Best default metrics, weak on long contexts
TruLens	0.84	0.79	$3.60	2.7s	Strong feedback functions, observability built in
DeepEval	0.81	0.85	$5.40	4.2s	Pytest-native, best DX for CI
Phoenix (Arize)	0.83	0.80	$3.20	2.4s	Great for trace-level debugging
Custom LLM-as-judge (Claude Sonnet + rubric)	0.91	0.88	$6.80	5.0s	Highest agreement, highest cost, most work

A few honest observations from this run.

No framework hits 0.95 agreement with humans on faithfulness. The state of the art is around 0.87-0.91 on a real corpus. If a vendor claims 99% accurate hallucination detection, ask which dataset.

Ragas and DeepEval disagree more than you'd expect. On about 14% of queries, one called the answer faithful and the other called it unfaithful. That's not noise — that's the prompt rubric. If you switch frameworks mid-project, recalibrate your thresholds. According to the Ragas 0.2 release notes, their faithfulness metric was reworked in 2025 to use a claim-decomposition approach, which is closer to how a careful human reviewer thinks.

Phoenix is underrated for debugging. It won't beat Ragas on raw scoring, but the per-trace view is where I actually find why a metric dropped.

The benchmark numbers I'd publish under my name

These are the production-ready targets I aim for on a B2B knowledge-base RAG, gathered from across the systems I've shipped and validated against the public Stanford HELM benchmarks for grounding tasks.

Metric	Floor (don't ship below)	Target	Stretch
Faithfulness	0.85	0.92	0.96
Context precision @ 10	0.70	0.82	0.90
Context recall	0.75	0.88	0.94
Answer relevancy	0.80	0.90	0.95
Citation accuracy	0.85	0.93	0.97
Refusal rate on OOS queries	0.80	0.92	0.97
P95 latency (end-to-end)	4.0s	2.5s	1.5s
Cost per query	$0.04	$0.015	$0.006

The two numbers most teams blow past are refusal rate on out-of-scope queries and citation accuracy. A model that always answers, even when the corpus doesn't contain the answer, will look brilliant on relevancy and lethal in production. I won't ship below an 80% refusal rate on a held-out OOS set, and I build that set deliberately by writing queries the corpus cannot answer.

How I actually run an evaluation in CI

The eval shouldn't live in a notebook. It lives in the pipeline. Here's the structure I use, framework-agnostic.

# evals/test_rag_faithfulness.py
import pytest
from ragas import evaluate
from ragas.metrics import faithfulness, context_precision, answer_relevancy
from datasets import Dataset
from my_rag import answer

GOLD = load_gold_set("evals/gold_500.jsonl")

@pytest.fixture(scope="session")
def results():
    rows = []
    for q in GOLD:
        out = answer(q["question"])
        rows.append({
            "question": q["question"],
            "answer": out.text,
            "contexts": out.retrieved_chunks,
            "ground_truth": q["gold_answer"],
        })
    return evaluate(
        Dataset.from_list(rows),
        metrics=[faithfulness, context_precision, answer_relevancy],
    )

def test_faithfulness_floor(results):
    assert results["faithfulness"] >= 0.85, results

def test_context_precision_floor(results):
    assert results["context_precision"] >= 0.70, results

That file runs on every PR that touches the retrieval or prompt code. The CI bill is real — about $4-6 per run on the 500-query set — so I gate the full suite to PRs labeled rag-change and run a 50-query smoke set on every commit. The smoke set is stratified: 10 easy lookups, 10 multi-hop, 10 entity-heavy, 10 numerical, 10 out-of-scope. It catches roughly 70% of regressions for 10% of the cost.

For trend monitoring, the same eval runs nightly against production traffic samples and pushes to a Postgres table. A Grafana panel watches for week-over-week drops greater than 3 points on faithfulness. According to the 2025 Anthropic evals cookbook, this kind of "shadow eval" against real traffic catches drift that synthetic test sets miss, particularly when your users start asking new categories of questions your gold set never covered.

The traps that quietly inflate your scores

I've watched perfectly good metrics tell a story that wasn't true. Three failure modes show up again and again.

The judge is the same model as the generator. If Claude writes the answer and Claude grades the answer, your faithfulness score has a built-in bias toward Claude's prose style. I now use a different model family as judge — typically GPT-4o-class judging Claude output, or vice versa. Cross-family judging dropped my reported faithfulness by an average of 4 points across three projects, and that lower number was the true one.

The gold set leaks into the corpus. If you wrote the gold answers by reading the corpus, your test set is biased toward what's easy to retrieve. Always include a slice of gold queries written by domain users who have not seen the corpus. Those queries reveal the gap between what the system knows and what users actually ask.

Faithfulness on a non-answer is meaningless. "I don't have enough information" is trivially faithful — it makes zero claims. If your refusal rate is creeping up, faithfulness will rise along with it and look like an improvement. Always pair faithfulness with refusal rate and answer relevancy. A 0.96 faithfulness with 0.40 relevancy means the system is refusing to answer half the time.

A useful sanity check: the recent Microsoft Research paper on RAG evaluation (mid-2025) showed that LLM-as-judge scores correlate ~0.7-0.85 with human ratings on faithfulness, but drop to 0.5-0.6 on subjective qualities like "helpfulness." Treat objective metrics as signal, subjective ones as direction.

What changes when you move to agentic RAG

The 12-point checklist still applies, but two things shift when the system is an agent doing multi-step retrieval rather than a single-shot RAG.

First, you measure trajectory faithfulness, not just final-answer faithfulness. The agent might retrieve five times. Each retrieval-and-reasoning step needs to be faithful to what it found. A correct final answer built on a hallucinated intermediate step is a ticking bomb — it works until the question is one shade harder.

Second, tool-call precision becomes a metric. Did the agent call the right tool with the right arguments? On the BizFlowAI ContentStudio pipeline, this is what I monitor most closely: the content agent has maybe a dozen tools (search, fetch, embed, classify, draft, critique). When tool-call precision drops below 0.85, output quality collapses two days later. It's the leading indicator.

For agentic systems, I extend the checklist with: tool-call precision, trajectory length (shorter is usually better), recovery rate after a failed tool call, and cost-per-completed-task rather than cost-per-query.

What I'd do if I were starting today

Pick Ragas for scoring and Phoenix for tracing, in that order. Add a custom Claude-or-GPT-judged rubric for your two or three highest-stakes metrics, where the extra cost is worth the extra agreement. Build a 50-query smoke set this week and a 500-query gold set in the first month. Gate deploys on the floor numbers, not the targets — floors prevent disasters, targets are aspirations.

Spend more time on the gold set than on the framework. The framework is mostly fungible; the eval data is the moat. A great test set is the single most underrated artifact in an LLM project, and it's the thing that lets you swap models, change prompts, or migrate vector stores without flying blind.

Run the eval in CI from day one, not after the first incident. Once a system is in production without an automated eval, every change becomes a guess, and guesses compound.

If you're building or auditing a RAG system and want a second pair of eyes — or a benchmark run against your own corpus — I take on a small number of engagements each quarter. You can reach me at lazar-milicevic.com/#contact, or browse more posts on the blog on agent evals, production LLMs, and serverless AI architecture.

Frequently asked questions

What RAG evaluation metrics actually matter before shipping to production?

I run a 12-metric checklist across three layers: retrieval (context precision, context recall, entity recall, MRR@10), generation (faithfulness, answer relevancy, answer correctness, citation accuracy), and system (out-of-scope refusal rate, P95 latency, cost per query, and drift over time). Most teams measure retrieval and generation but skip the system layer, which is exactly where production failures hide. If you only have time for four, run faithfulness, context precision, out-of-scope refusal rate, and citation accuracy — those catch roughly 80% of real-world failures. A high answer relevancy score alone is misleading; I've seen pipelines hit 0.91 relevancy while still hallucinating client account numbers.

Which RAG evaluation framework is most accurate: Ragas, TruLens, DeepEval, or Phoenix?

I benchmarked all four plus a custom Claude-Sonnet judge against a human-labeled gold set on 500 queries over a 12,000-document corpus. Ragas 0.2 had the highest faithfulness agreement with humans at 0.87, followed by TruLens (0.84), Phoenix (0.83), and DeepEval (0.81); a custom LLM-as-judge with a tuned rubric hit 0.91 but cost nearly double. DeepEval actually beat Ragas on context precision agreement (0.85 vs 0.82) and has the best CI developer experience since it's pytest-native. Phoenix is underrated for trace-level debugging even though its raw scores are middle of the pack. No framework on the market hits 0.95 human agreement on faithfulness for a real corpus, so be skeptical of vendor claims above that.

What are realistic production targets for RAG faithfulness and citation accuracy?

For a B2B knowledge-base RAG, I won't ship below 0.85 faithfulness, and I target 0.92 with a 0.96 stretch goal. Citation accuracy has a similar floor at 0.85 and a target of 0.93. Context precision @ 10 should clear 0.70 minimum with 0.82 as the realistic target, and answer relevancy should sit at 0.90 or above. These numbers are calibrated against systems I've shipped and cross-checked with the Stanford HELM grounding benchmarks. Anything significantly higher than these targets on a real production corpus usually means the eval set is too easy, not that the system is exceptional.

Why do RAG systems hallucinate even when answer relevancy scores are high?

Answer relevancy only measures whether the response addresses the question — it says nothing about whether the claims are actually supported by the retrieved context. That's why I once had a pipeline at 0.91 relevancy fabricating client account numbers: the answers were on-topic but ungrounded. Faithfulness (also called groundedness) is the metric that catches hallucination, because it decomposes the answer into individual claims and checks each one against retrieved chunks. You also need citation accuracy to verify that cited chunk IDs actually contain the cited claim, and a high out-of-scope refusal rate so the model says "I don't know" instead of inventing an answer when the corpus is silent.

How do I test that a RAG system properly refuses out-of-scope questions?

I build a deliberate out-of-scope test set by writing queries the corpus provably cannot answer, then measure what percentage the system correctly refuses with "I don't know" or equivalent. My production floor is 80% refusal on this held-out OOS set, with 0.92 as the target and 0.97 as a stretch goal. This is the single most-skipped metric I see in the field, and it's the one that separates demos from production-ready systems — a model that always answers will look brilliant on relevancy scores and ship hallucinations to real users. You have to construct the OOS set manually because automated evaluations won't generate adversarial absences for you.

Lazar Milicevic

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Work with me →

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts