Building a Production RAG Eval Harness from Scratch

Every RAG tutorial ends at the same point: you ask the system a question, it returns something plausible, and the author declares success. This is the stage where most production RAG projects also get declared complete, which is exactly when the problems start.

“Something plausible” is not a metric. Plausible-but-wrong answers erode user trust faster than no answer at all. Plausible-but-hallucinated answers are a liability. And “it worked in my dev environment” is not evidence that it will keep working after you change your chunking strategy, swap your embedding model, or upgrade the underlying LLM.

A production RAG system needs an evaluation harness: a reproducible, automated way to measure quality across the full pipeline and detect regressions before they reach users. This article builds one from scratch.

The three things you are measuring

RAG quality has three distinct components, and most teams conflate them — which is why they cannot diagnose quality problems when they occur.

Retrieval quality — did the system find the right chunks? This is measurable entirely independently of the LLM. You can evaluate your retrieval pipeline against a golden dataset without calling a language model at all, which makes it fast and cheap to run in CI.

Generation faithfulness — given the retrieved context, did the LLM produce a response that is accurate and grounded in that context? This is separate from retrieval: even with perfect retrieval, a model can hallucinate or add information not present in the chunks.

System-level quality — end-to-end: is the final answer useful and correct from the user’s perspective? This is the hardest to measure objectively but often what users actually care about. It sits on top of the other two; if retrieval or generation quality is poor, system-level quality will be poor regardless.

Evaluate all three separately. When quality degrades, you need to know whether the retrieval pipeline got worse, the generation got worse, or both. Teams that only evaluate at the system level spend hours debugging the wrong layer.

Step 1: Build your golden dataset

No evaluation harness works without ground truth. This is the part people skip because it requires human judgment, but it is the most important investment you will make in your eval infrastructure.

A golden dataset for RAG consists of three things for each example:

A question representative of real user queries
A reference answer (or at minimum, the correct facts the answer should contain)
The set of document chunks that should be retrieved to answer the question

For a starting harness, 50 to 100 examples is sufficient to detect meaningful regressions. You do not need a thousand. Start with 50 and grow it as you encounter real failure modes.

The question/answer pairs should come from actual usage data if you have it, or be written by a domain expert if you do not. The relevant chunks require annotation — either manual, or LLM-generated and then verified. Do not skip the verification step; an incorrect golden set will produce misleading metrics.

Store your golden dataset in a versioned file, not in a database. YAML or JSON works well. The version history matters because your baseline metrics are only meaningful relative to the specific dataset version that produced them.

# data/golden_dataset.yaml
questions:
  - id: q001
    query: "What is the difference between RAG and fine-tuning?"
    reference_answer: "RAG retrieves relevant context at inference time and does not modify model weights. Fine-tuning bakes knowledge into model weights at training time and does not require retrieval at inference. They solve different problems and can be combined."
    relevant_chunk_ids:
      - doc_42_chunk_3
      - doc_17_chunk_1

  - id: q002
    query: "When should I use hybrid search instead of dense-only retrieval?"
    reference_answer: "Hybrid search (dense + sparse) is preferable when your domain has precise terminology, product codes, or proper nouns that embedding models may not handle well. Dense-only retrieval excels at semantic similarity but can miss exact-match requirements."
    relevant_chunk_ids:
      - doc_08_chunk_2
      - doc_08_chunk_5
      - doc_31_chunk_1

Step 2: Measure retrieval quality

With your golden dataset in place, you can evaluate your retrieval pipeline in isolation. Three metrics matter most:

Recall@k — of the chunks that should be retrieved, what fraction actually appear in the top-k results? This is your most important metric. If your golden dataset says chunks A, B, and C are relevant to question Q, and your retriever returns 10 chunks but misses B, your Recall@3 for that question is 0.67. A low Recall@5 is the single clearest signal that no amount of generation-layer work will fix your system’s quality problems.

Precision@k — of the top-k chunks retrieved, what fraction are actually relevant? High recall with low precision means you are returning a lot of noise alongside the signal. The LLM has to work harder to ignore the irrelevant context, and you are paying for tokens that contribute nothing.

Mean Reciprocal Rank (MRR) — where does the first relevant chunk appear in the ranked list? If the most relevant chunk is consistently at position 1, MRR approaches 1.0. If it is often buried at position 5 or 6, your retrieval ranking is miscalibrated regardless of recall.

from typing import List, Dict
import numpy as np


def recall_at_k(
    retrieved_ids: List[str],
    relevant_ids: List[str],
    k: int,
) -> float:
    if not relevant_ids:
        return 1.0
    top_k = set(retrieved_ids[:k])
    relevant = set(relevant_ids)
    return len(top_k & relevant) / len(relevant)


def precision_at_k(
    retrieved_ids: List[str],
    relevant_ids: List[str],
    k: int,
) -> float:
    if k == 0:
        return 0.0
    top_k = retrieved_ids[:k]
    relevant = set(relevant_ids)
    return sum(1 for doc_id in top_k if doc_id in relevant) / k


def mean_reciprocal_rank(
    retrieved_ids: List[str],
    relevant_ids: List[str],
) -> float:
    relevant = set(relevant_ids)
    for rank, doc_id in enumerate(retrieved_ids, start=1):
        if doc_id in relevant:
            return 1.0 / rank
    return 0.0


def evaluate_retrieval(
    retriever,
    golden_dataset: List[Dict],
    k: int = 5,
) -> Dict[str, float]:
    recall_scores, precision_scores, mrr_scores = [], [], []

    for item in golden_dataset:
        retrieved = retriever.retrieve(item["query"], top_k=k)
        retrieved_ids = [chunk.id for chunk in retrieved]
        relevant_ids = item["relevant_chunk_ids"]

        recall_scores.append(recall_at_k(retrieved_ids, relevant_ids, k))
        precision_scores.append(precision_at_k(retrieved_ids, relevant_ids, k))
        mrr_scores.append(mean_reciprocal_rank(retrieved_ids, relevant_ids))

    return {
        f"recall@{k}":    round(float(np.mean(recall_scores)), 4),
        f"precision@{k}": round(float(np.mean(precision_scores)), 4),
        "mrr":            round(float(np.mean(mrr_scores)), 4),
    }

Run this before you invest time in the generation layer. Know your retrieval baseline. If Recall@5 is 0.45, roughly half the time at least one relevant chunk is absent from the context you send to the LLM — and no amount of prompt engineering will compensate for context that was never retrieved.

A reasonable baseline to aim for before moving on: Recall@5 ≥ 0.75 on your golden set. If you are below that, fix retrieval first — try different chunk sizes, add metadata filters, or introduce re-ranking.

Step 3: Measure generation faithfulness

Once retrieval is measured and acceptable, you evaluate generation. The core question: does the generated answer stay grounded in the retrieved context, or does the model add information that was not there?

The most scalable approach is LLM-as-judge: use a language model to score the faithfulness of each response relative to the retrieved chunks. This is imperfect — the judge can be wrong — but it is automatable and orders of magnitude cheaper than human review at scale.

The judge prompt matters significantly. Vague prompts produce noisy, unreliable scores. Be explicit about the scoring rubric:

FAITHFULNESS_PROMPT = """\
You are evaluating whether an AI-generated answer is faithful to the
provided source documents.

Source documents:
{context}

Question: {question}

Generated answer: {answer}

A faithful answer:
- Contains only information present in or directly inferable from the sources
- Does not add facts, statistics, or claims not supported by the sources
- Acknowledges uncertainty when the sources are ambiguous or incomplete

Rate faithfulness on a scale of 1–5:
  5 = Fully faithful: every claim is grounded in the sources
  4 = Mostly faithful: one minor unsupported inference
  3 = Partially faithful: some claims supported, some not
  2 = Mostly unfaithful: significant unsupported claims
  1 = Unfaithful: answer largely ignores or contradicts the sources

Respond with JSON only: {{"score": <1-5>, "reasoning": "<one sentence>"}}
"""


def score_faithfulness(
    llm_client,
    question: str,
    context_chunks: List[str],
    answer: str,
) -> Dict:
    context = "\n\n---\n\n".join(context_chunks)
    prompt = FAITHFULNESS_PROMPT.format(
        context=context,
        question=question,
        answer=answer,
    )
    response = llm_client.complete(
        prompt,
        response_format={"type": "json_object"},
    )
    return response.parsed  # {"score": int, "reasoning": str}


def evaluate_generation(
    rag_pipeline,
    llm_client,
    golden_dataset: List[Dict],
) -> Dict[str, float]:
    scores = []
    for item in golden_dataset:
        result = rag_pipeline.run(item["query"])
        faithfulness = score_faithfulness(
            llm_client,
            question=item["query"],
            context_chunks=[c.text for c in result.retrieved_chunks],
            answer=result.answer,
        )
        scores.append(faithfulness["score"])

    return {
        "faithfulness_mean": round(float(np.mean(scores)), 4),
        "faithfulness_p10":  round(float(np.percentile(scores, 10)), 4),
    }

Track the p10 faithfulness score alongside the mean. The mean can look acceptable while the tail — your worst 10% of responses — is actively harmful. The p10 is where your users who get bad answers live.

Step 4: Track latency and cost

These two metrics belong in the eval harness, not only in production monitoring. The reason: every time you change your retrieval strategy or swap a model, you change your latency profile and your cost per request. You want to see those tradeoffs in the same report as quality metrics — not discover them after shipping.

Track per-request:

Retrieval latency (vector database query time, including re-ranking if applicable)
LLM time-to-first-token and total completion time
Input token count, output token count, and estimated cost

import time
from dataclasses import dataclass, field
from statistics import mean, quantiles


@dataclass
class RequestTrace:
    query: str
    retrieval_ms: float
    llm_ms: float
    input_tokens: int
    output_tokens: int
    cost_usd: float


def timed_retrieve(retriever, query: str, k: int):
    t0 = time.monotonic()
    chunks = retriever.retrieve(query, top_k=k)
    elapsed_ms = (time.monotonic() - t0) * 1000
    return chunks, elapsed_ms


def aggregate_latency(traces: List[RequestTrace]) -> Dict:
    retrieval = [t.retrieval_ms for t in traces]
    llm = [t.llm_ms for t in traces]
    return {
        "retrieval_p50_ms": round(quantiles(retrieval, n=2)[0], 1),
        "retrieval_p95_ms": round(quantiles(retrieval, n=20)[18], 1),
        "llm_p50_ms":       round(quantiles(llm, n=2)[0], 1),
        "llm_p95_ms":       round(quantiles(llm, n=20)[18], 1),
        "cost_per_req_usd": round(mean(t.cost_usd for t in traces), 6),
    }

A retrieval strategy change that improves Recall@5 by 0.08 but doubles p95 retrieval latency is a tradeoff, not a clear win. Seeing both numbers in the same eval report forces the right conversation.

Step 5: Wire it into CI

An eval harness that runs manually is better than nothing. An eval harness that runs automatically on every relevant pull request is what actually prevents regressions from reaching production.

The CI setup has four requirements:

The golden dataset lives in the repository and is versioned alongside the code.
The eval suite runs on every pull request that touches the retrieval pipeline, the prompt templates, or the embedding model configuration.
The build fails — or at minimum, posts a visible metric delta comment — if key metrics regress beyond your configured threshold.
Each run appends a row to a metrics history file so you can trend quality over time.

# .github/workflows/rag-eval.yml
name: RAG Evaluation

on:
  pull_request:
    paths:
      - "src/retrieval/**"
      - "src/prompts/**"
      - "config/embedding_model.yaml"
      - "data/golden_dataset.yaml"

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python 3.12
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run RAG eval suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python scripts/eval_rag.py \
            --golden-set data/golden_dataset.yaml \
            --output results/eval_$(git rev-parse --short HEAD).json

      - name: Check regression thresholds
        run: python scripts/check_thresholds.py results/eval_*.json

      - name: Upload eval results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/

The check_thresholds.py script reads the eval output, compares each metric against the configured baseline, and exits with code 1 if any metric has regressed beyond the defined tolerance — causing the CI check to fail and blocking the merge.

# scripts/check_thresholds.py
import json, sys
from pathlib import Path

THRESHOLDS = {
    "recall@5":          {"min": 0.70, "tolerance": -0.03},
    "mrr":               {"min": 0.65, "tolerance": -0.03},
    "faithfulness_mean": {"min": 3.80, "tolerance": -0.20},
    "faithfulness_p10":  {"min": 3.00, "tolerance": -0.30},
}

results_path = Path(sys.argv[1])
results = json.loads(results_path.read_text())

failures = []
for metric, config in THRESHOLDS.items():
    value = results.get(metric)
    if value is None:
        continue
    if value < config["min"] + config["tolerance"]:
        failures.append(f"  {metric}: {value:.4f} (min {config['min']:.4f})")

if failures:
    print("Eval regression detected:\n" + "\n".join(failures))
    sys.exit(1)

print("All metrics within acceptable range.")

What this actually buys you

When the harness is in place, you can do things that teams without one cannot:

Change retrieval strategy without fear. Try a different chunking approach, add a re-ranking step, switch from top-5 to top-8 — and know within a CI run whether Recall@k improved or regressed, and by how much.

Benchmark model upgrades confidently. When a new embedding model is released or the LLM provider changes pricing, you have a reproducible, automated way to measure whether quality changed and in which direction.

Diagnose production incidents precisely. When users complain about response quality, you have a methodology for root-causing the failure — retrieval, generation, or both — instead of guessing.

Make credible, specific claims. “Our retrieval achieves Recall@5 of 0.81 on our domain-specific golden set of 87 questions” is a statement that can be reproduced, challenged, and improved. “Our system works really well” cannot.

The eval harness is not glamorous work. It does not make it into demos. But it is the thing that separates production AI systems from AI prototypes — and the single clearest signal in a portfolio that says this engineer has shipped AI in the real world, not just in a Colab notebook.

If you are building toward an Applied AI Engineering role, publish your eval harness alongside your RAG application. The combination is a receipt. Receipts beat credentials.

The inferLearn Roadmap covers every phase of this build — from first LLM API call to production eval-driven development — with specific resources and clear criteria for each step.