I wrote this after realizing I could not answer a basic question: “how often is my retrieval system wrong?” I was running this in an isolated personal environment and validating access patterns through Tailscale, but I still had no quantitative error baseline.
My first “evaluation” was a thumbs-up/thumbs-down widget. That gave weak, biased signal and no way to debug specific failures. I also noticed confident hallucinations of file paths during testing. So I built a repeatable evaluation loop and treated this post as my notes from that process.
This is what I learned: which metrics were useful, where LLM-as-judge helped, and where it failed.
What I Was Trying to Build
The overarching goal of this series is a technical documentation repository that AI coding agents can query safely for code-level guidance.
In practice, that requires three connected layers:
- A reliable ingestion/indexing layer.
- A high-precision retrieval layer.
- A measurable evaluation layer.
This post is Part 3: evaluation and quality governance.
On terminology. A RAG retrieval pipeline is not an agent. An agent, in the technical sense, implies autonomous tool use, multi-step planning, and environment interaction: running code, writing files, calling external services. What we built is a high-precision knowledge lookup service that agents invoke as a tool. This distinction matters because the evaluation methodology differs. We are not evaluating whether a system takes correct actions; we are evaluating whether it retrieves correct context and reasons faithfully over it. Those are tractable, measurable problems.
Table of contents
Open Table of contents
Why Retrieval System Evaluation Is Hard
Evaluating a knowledge-grounded retrieval system is fundamentally harder than evaluating most ML tasks because of what I’ll call the double indirection problem. When you evaluate a classifier, ground truth is clear: the label matches or it doesn’t. With a RAG system, “did it respond correctly?” depends on three separate questions:
- Were the right file chunks retrieved? (Retrieval quality)
- Did the LLM use those chunks faithfully? (Generation quality)
- Was the final answer actually correct? (End-to-end quality)
These fail independently. The system can retrieve the right code file and then hallucinate a function signature. It can generate a technically correct answer from the wrong file entirely. Or it can cite the wrong source and the LLM happens to get it right anyway, making your retrieval look better than it is.
You need to measure all three. They require different approaches.
Why these evaluation tools
I am explicit about this because evaluation stacks can become needlessly complex, especially for independent projects.
- RAGAS-style metrics: useful because they separate retrieval and generation failure modes.
- Ray: useful only because I was evaluating hundreds of queries per run and needed parallelism.
- MLflow: useful because I needed run history and metric trends, not one-off screenshots.
If your dataset is small, a simple notebook + CSV logging is enough to start.
The Evaluation Triad
The RAG evaluation community has converged on three core metrics that together tell you most of what you need to know about system health. These came largely from the RAGAS paper and subsequent work.
flowchart TD
subgraph Triad["The RAG Evaluation Triad"]
A["Query + Context + Answer"]
A --> B["Faithfulness\nDoes the answer\nstay within the retrieved chunks?"]
A --> C["Answer Relevance\nDoes the answer\naddress the question?"]
A --> D["Context Precision\nAre the retrieved chunks\nactually relevant?"]
B --> E{Overall System Quality}
C --> E
D --> E
end
subgraph Failures["What Each Metric Catches"]
B --> F["Hallucination:\nLLM invents facts\nnot present in any file"]
C --> G["Off-topic Response:\nLLM answers a different\nquestion than asked"]
D --> H["Retrieval Failure:\nRight file chunks\nweren't returned at all"]
end
style Triad fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb
style Failures fill:#3b0764,stroke:#a855f7,color:#f9fafb
Faithfulness: Is the answer grounded in the retrieved files?
Faithfulness measures whether the claims in the system’s answer can be attributed to the retrieved chunks. An unfaithful answer is one where the LLM has gone beyond — or contradicted — what the source files say.
We compute this by decomposing the answer into atomic claims, then verifying each against the retrieved context:
import json
from dataclasses import dataclass
from openai import AzureOpenAI
@dataclass
class FaithfulnessResult:
score: float # 0.0 to 1.0
claims: list[str]
verdicts: list[dict]
def compute_faithfulness(
query: str,
context: list[str], # retrieved file chunks
answer: str,
client: AzureOpenAI,
) -> FaithfulnessResult:
context_str = "\n\n".join(
f"[File chunk {i+1}]: {c}" for i, c in enumerate(context)
)
# Step 1: Extract atomic claims from the answer
claims_resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": (
"Break this answer into a list of atomic factual claims. "
"Each claim should be a single, independently verifiable sentence.\n\n"
f"Answer: {answer}\n\n"
'Return JSON: {"claims": [str, ...]}'
),
}],
response_format={"type": "json_object"},
)
claims = json.loads(claims_resp.choices[0].message.content)["claims"]
# Step 2: Verify each claim against the retrieved file chunks
verdicts_resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": (
"For each claim, determine if it is supported, contradicted, "
"or not mentioned in the file chunks below.\n\n"
f"File chunks:\n{context_str}\n\n"
f"Claims:\n{json.dumps(claims, indent=2)}\n\n"
'Return JSON: {"verdicts": [{"claim": str, '
'"verdict": "supported|contradicted|not_mentioned", "reasoning": str}]}'
),
}],
response_format={"type": "json_object"},
)
verdicts = json.loads(verdicts_resp.choices[0].message.content)["verdicts"]
supported = sum(1 for v in verdicts if v["verdict"] == "supported")
score = supported / len(verdicts) if verdicts else 0.0
return FaithfulnessResult(score=score, claims=claims, verdicts=verdicts)
When I first ran this on production-like queries, faithfulness was 0.63. That gave me a concrete baseline instead of guesswork.
Answer Relevance: Does the answer address the question?
Answer relevance is often confused with faithfulness. Faithful means “the answer stays within the retrieved files.” Relevant means “it answers the question that was actually asked.” You can be perfectly faithful and completely irrelevant.
The RAGAS approach is elegant: ask the LLM to reverse-engineer the question from the answer, then measure similarity between the generated question and the original.
from sentence_transformers import SentenceTransformer
import numpy as np
sim_model = SentenceTransformer("all-MiniLM-L6-v2")
def compute_answer_relevance(
query: str,
answer: str,
client: AzureOpenAI,
n_reverse: int = 3,
) -> float:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": (
f"Generate {n_reverse} different questions that this answer would "
"be a good response to. Vary the phrasing.\n\n"
f"Answer: {answer}\n\n"
'Return JSON: {"questions": [str, ...]}'
),
}],
response_format={"type": "json_object"},
)
reverse_qs = json.loads(resp.choices[0].message.content)["questions"]
query_emb = sim_model.encode([query])[0]
reverse_embs = sim_model.encode(reverse_qs)
sims = [
np.dot(query_emb, rev) / (np.linalg.norm(query_emb) * np.linalg.norm(rev))
for rev in reverse_embs
]
return float(np.mean(sims))
Context Precision: Were the retrieved chunks actually useful?
Context precision measures whether the file chunks you retrieved were genuinely relevant to the query. High faithfulness and relevance with low context precision usually means the system got lucky — or retrieved ten chunks but only one mattered.
Irrelevant context dilutes the signal for the LLM and wastes tokens. Surface this metric and it tells you when your retrieval is sloppy even if the final answers look OK.
def compute_context_precision(
query: str,
contexts: list[str], # retrieved file chunks
answer: str,
client: AzureOpenAI,
) -> float:
verdicts = []
for ctx in contexts:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": (
'Was this file chunk useful for answering the query? '
'Answer with only "yes" or "no".\n\n'
f"Query: {query}\n"
f"System answer: {answer}\n"
f"File chunk: {ctx}"
),
}],
)
verdicts.append(resp.choices[0].message.content.strip().lower() == "yes")
# Calculate average precision (rewards relevant items ranked higher)
precision_at_k = []
relevant_so_far = 0
for k, is_relevant in enumerate(verdicts, 1):
if is_relevant:
relevant_so_far += 1
precision_at_k.append(relevant_so_far / k)
return float(np.mean(precision_at_k)) if precision_at_k else 0.0
Minified Evaluation Pipeline
Running these metrics manually works for spot-checking. What you need is an automated pipeline that evaluates a representative sample of queries regularly — so you catch regressions before users notice.
flowchart TD
A[Golden queries + sampled prod queries] --> B[Batch eval]
B --> C[Faithfulness, relevance, precision]
C --> D[MLflow metrics]
C --> E[Failure bucket summary]
E --> F[Human review sample]
F --> G[Retrieval or prompt fix]
G --> A
style A fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb
style B fill:#14532d,stroke:#22c55e,color:#f9fafb
style D fill:#3b0764,stroke:#a855f7,color:#f9fafb
style F fill:#7c2d12,stroke:#f97316,color:#f9fafb
Golden question set: ~500 questions with human-verified answers, written by developers who know the codebase. These run on every deployment and act as the regression baseline.
Production sampling: Every day, we sample 200 real queries, run them through the retrieval system, and evaluate with the triad. This catches distribution shift — when what users actually ask starts diverging from what we tested.
Ray for batch evaluation: Evaluating 700 query-answer-context triples sequentially would take two hours. With Ray fanning out, it takes 12 minutes.
import ray
from dataclasses import dataclass
from typing import Optional
from openai import AzureOpenAI
@dataclass
class EvalSample:
query: str
context: list[str] # retrieved file chunks
answer: str
expected_answer: Optional[str] = None
@ray.remote
class RAGEvaluator:
def __init__(self):
self.client = AzureOpenAI(
azure_endpoint=AZURE_OPENAI_ENDPOINT,
api_key=AZURE_OPENAI_KEY,
api_version="2024-02-01",
)
self.sim_model = SentenceTransformer("all-MiniLM-L6-v2")
def evaluate(self, sample: EvalSample) -> dict:
faithfulness = compute_faithfulness(
sample.query, sample.context, sample.answer, self.client
)
relevance = compute_answer_relevance(
sample.query, sample.answer, self.client
)
precision = compute_context_precision(
sample.query, sample.context, sample.answer, self.client
)
return {
"query": sample.query,
"faithfulness": faithfulness.score,
"answer_relevance": relevance,
"context_precision": precision,
"composite": (faithfulness.score + relevance + precision) / 3,
"faithfulness_details": faithfulness.verdicts,
}
# Fan out across 8 actor workers
evaluators = [RAGEvaluator.remote() for _ in range(8)]
futures = [
evaluators[i % len(evaluators)].evaluate.remote(sample)
for i, sample in enumerate(eval_samples)
]
results = ray.get(futures)
When the composite score drops below threshold, I send a compact Slack summary so I can triage quickly:
import requests
def send_eval_alert(results: list[dict], webhook_url: str) -> None:
failed = [r for r in results if r["composite"] < 0.7]
if not failed:
return
low_faith = [r for r in failed if r["faithfulness"] < 0.6]
low_rel = [r for r in failed if r["answer_relevance"] < 0.6]
low_prec = [r for r in failed if r["context_precision"] < 0.6]
message = (
f":warning: *RAG Eval Alert* — {len(failed)}/{len(results)} queries below threshold\n"
f"• Low faithfulness: {len(low_faith)} queries\n"
f"• Low relevance: {len(low_rel)} queries\n"
f"• Low precision: {len(low_prec)} queries\n"
f"Review: <https://your-mlflow-url/experiments/rag-eval|MLflow Dashboard>"
)
requests.post(webhook_url, json={"text": message})
LLM-as-Judge: The Calibration Problem
Using an LLM to evaluate an LLM is inherently circular, and the criticism is valid. The judge can be wrong. It has biases — toward longer answers, toward confident language, toward its own training distribution.
We handle this three ways:
1. Use a different model family as judge. Our retrieval system uses GPT-4o for generation; we use Claude Sonnet as the judge. Cross-family agreement is a stronger signal than same-family self-evaluation.
2. Calibrate against humans. Take 200 samples, have developers label them for faithfulness (binary: faithful or not), and compare to your LLM judge. At 88% agreement it’s reliable enough for automated use. Below 80%, retune the prompt.
3. Surface disagreements, not just averages. When faithfulness is 0.72, you want to know which queries failed, not just that 28% did. The most valuable output from the eval pipeline is a curated set of failure cases for human review.
flowchart LR
A[Eval Pipeline Run] --> B[Compute Metrics]
B --> C{Composite Score >= 0.8?}
C -->|Yes| D[Log to MLflow\nall metrics]
C -->|No| E[Identify Low-Score Samples\nscore < 0.6]
E --> F[Cluster by Failure Type\nvia embeddings]
F --> G[Sample Representatives\nfrom each cluster]
G --> H[Human Review Queue\n~20 samples/day]
H --> I[Add to Labeled Dataset\nif systematic failure]
I --> J[Fix Retrieval or Prompt]
J --> K[Re-evaluate on Full Set]
K --> A
style A fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb
style H fill:#7c2d12,stroke:#f97316,color:#f9fafb
style J fill:#14532d,stroke:#22c55e,color:#f9fafb
The human review queue is deliberately small — 20 samples per day. The goal isn’t to label everything; it’s to find categories of failure. When a reviewer notices a pattern — “the system always fabricates line numbers when citing source files” — that’s the signal to fix the prompt, not review 200 individual cases.
What I Learned from Monitoring
After running this for a while, the biggest lesson was that metrics without baselines are meaningless.
My 0.63 faithfulness score sounded alarming, but it was still useful because it gave me a start point I could improve from.
Second lesson: the bottleneck metric changes. I initially had relevance problems, then precision problems, then faithfulness problems.
Third lesson: some failures are invisible to the triad. If retrieval returns mostly irrelevant chunks, the model can still answer confidently from priors. I had to add a coverage metric to catch this case.
An upper bound on this: in one week’s production sample, we identified 34 queries (out of 200) where all retrieved chunks scored below 0.3 context precision. The system responded to 31 of those. A coverage metric — “fraction of queries where at least one retrieved chunk has precision > threshold” — would have caught this class of failure directly.
This post concludes the ML Engineering series for now. The infrastructure that runs this evaluation pipeline is described in Part 1; the retrieval system being evaluated is in Part 2.