Cutting Down 'Plausible but Wrong' RAG Answers — A Retrieval Evaluation Harness for Gemma 4 and Antigravity
Replace gut feeling with recall@5, MRR and faithfulness scores — a 30-question golden dataset and a small Python harness for evaluating a local Gemma 4 RAG stack.
In the spring of 2026, I was wiring up a RAG system over the support knowledge base for the wallpaper apps I run as an indie developer. A user asked why the Android version showed a blank white screen after switching themes. The retrieval step confidently handed the model a chunk about restoring purchases on iOS. The generated answer read beautifully — polite, structured, plausible. It also had nothing to do with the question.
That is what makes "plausible but wrong" RAG answers so dangerous: they fail silently. When retrieval misses, the language model doesn't throw an error. It takes whatever chunks it received and weaves them into fluent prose, and the fluency hides the failure. When I later audited my error logs, roughly 70% of the bad answers traced back to retrieval missing the right document entirely — not to the generation step I had been busy tuning.
Instead of adopting a heavyweight evaluation platform, I built a small harness: a 30-question golden dataset and a few dozen lines of Python. My recall@5 went from 0.63 to 0.86 through preprocessing and chunking changes I could finally verify, and more importantly, I gained the habit of checking numbers after every change. This article walks through that harness, end to end, on a local Gemma 4 stack. If you haven't built the pipeline itself yet, Building a RAG Pipeline with Gemma 4 and ChromaDB: Implementation covers that ground — here we focus on measuring and improving a RAG that already runs.
Why Fluent Generation Hides Retrieval Failures
RAG quality work goes in circles unless you separate two questions: did retrieval fetch the right material, and did generation stay faithful to it? My own first mistake was exactly this. I kept polishing prompts because the answers looked like generation problems. Only when I lined up the retrieval logs did I see that the correct chunk wasn't in the top five at all for most failures. Prompt work is the final polish you apply after retrieval is verified — do it earlier and you are tuning noise.
The two layers fail in characteristically different ways:
Retrieval failures: weak handling of paraphrases, orthographic variation (in Japanese support text, full-width vs. half-width characters and katakana variants; in English, product-name synonyms and abbreviations), and out-of-scope questions that get matched to the "closest" chunk anyway
Generation failures: dropping conditions or caveats during summarization even though the right chunk was provided, or blending two chunks into a procedure that never existed
Because the fixes are completely different, the evaluation needs two layers as well: mechanical ranking metrics for retrieval, and a judged faithfulness check for generation. That separation is the backbone of everything below.
One practical note before any framework: just logging retrieval in a human-readable form is a real step forward. I keep one JSON line per query — the question, the top-five chunk IDs with scores, and the final answer. When a wrong answer gets reported, identifying which layer failed takes minutes instead of an afternoon. The harness in this article is an extension of that log, not a replacement for it. For the broader pipeline patterns around vector search, Building a RAG Pipeline with Antigravity — Unlock Your Company's Knowledge with Vector Search and LLMs is a useful companion read.
Start With a 30-Question Golden Dataset
Evaluation suggests images of benchmark suites with hundreds of items. For an indie-scale RAG, 30 questions is genuinely enough to start — what matters is the mix of question types, not the count. I use four categories:
Factual (12 questions): the answer sits verbatim in one chunk — "How much is the ad-free version?"
Procedural (8 questions): multi-step explanations — "How do I restore my purchase after switching phones?"
Negative/constraint (5 questions): the correct answer states that something is not supported — "Does the app support lock-screen widgets?"
Out-of-scope (5 questions): nothing in the knowledge base answers this, and the correct behavior is to say so
The dataset is JSONL, one question per line, which keeps per-type aggregation trivial later:
{"id": "q001", "type": "factual", "question": "How much is the ad-free version?", "relevant_chunks": ["pricing-adfree-001"], "answerable": true}{"id": "q013", "type": "procedural", "question": "How do I restore purchases after moving to a new phone?", "relevant_chunks": ["restore-purchase-001", "restore-purchase-002"], "answerable": true}{"id": "q021", "type": "negative", "question": "Does the app support lock screen widgets?", "relevant_chunks": ["feature-scope-003"], "answerable": true}{"id": "q026", "type": "out_of_scope", "question": "Can I import wallpapers from other apps?", "relevant_chunks": [], "answerable": false}
Harvest Questions From Real User Phrasing
The single biggest factor in this dataset's value is not inventing the questions yourself. My primary sources were App Store and Google Play review replies and support emails. Real users phrase things in ways the author of the documentation never would. "The screen goes white" also arrived as "it turns blank," "the display disappears," and "it freezes and shows nothing." Those variations are exactly the paraphrase attacks your retrieval needs to survive, and they cost nothing to collect — they are already in your inbox.
Give Correct Chunks Stable IDs
The IDs in relevant_chunks must survive index rebuilds. If you label answers with sequential chunk numbers, every change to your splitting configuration silently invalidates the whole dataset. I use a string built from the document slug plus a heading slug (for example restore-purchase-001), so even when chunk boundaries move, labels can be re-derived from the document-and-heading mapping.
Always Include Out-of-Scope Questions
RAG systems lean structurally toward answering something. Failures on questions that should not be answered barely move an average score, which is why they go unnoticed. Five out-of-scope questions act as a small but reliable hallucination detector, especially combined with the judge rubric described below.
Grow the Dataset a Few Questions per Month
Thirty questions is a starting point, not a destination. I add two or three per month, picking newly arrived support questions whose phrasing looks likely to defeat retrieval. The dataset lives in git, so score changes can be matched against dataset commits — which makes it obvious when a dip means "the dataset got harder," not "the RAG got worse." Three months in, mine sits at 42 questions and is noticeably more trustworthy as a detector.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Replace the vague sense that your RAG sometimes misses with numbers you can track over time — recall@5, MRR, and nDCG
✦Build a 30-question golden dataset and a dependency-light Python evaluation harness that drops into any local Gemma 4 stack
✦Turn chunking and embedding choices into regression-tested decisions, wired into pytest and a nightly Antigravity background task
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Measure Retrieval on Its Own — recall@k, MRR, nDCG
Three metrics cover retrieval:
recall@k: 1 if any correct chunk appears in the top k, else 0. The first number to watch
MRR: the reciprocal rank of the first correct hit — 1.0 for first place, 0.33 for third. With local LLMs and tight context windows, whether the right chunk is near the top directly affects generation quality
nDCG@k: a smoother measure of ranking quality, useful when procedural questions have several correct chunks
The implementation needs nothing beyond the standard library. Injecting the search function keeps the evaluator reusable when you swap vector stores:
import jsonimport mathfrom pathlib import Pathdef load_golden(path): """Load the golden dataset from a JSONL file.""" items = [] for line in Path(path).read_text(encoding="utf-8").splitlines(): if line.strip(): items.append(json.loads(line)) return itemsdef recall_at_k(retrieved, relevant, k): """Return 1.0 if any relevant chunk appears in the top k results.""" return 1.0 if set(retrieved[:k]) & set(relevant) else 0.0def mrr(retrieved, relevant): """Return the reciprocal rank of the first relevant hit, or 0.0.""" for rank, chunk_id in enumerate(retrieved, start=1): if chunk_id in relevant: return 1.0 / rank return 0.0def ndcg_at_k(retrieved, relevant, k): """Binary nDCG: relevant chunks count as gain 1, others as 0.""" dcg = 0.0 for rank, chunk_id in enumerate(retrieved[:k], start=1): if chunk_id in relevant: dcg += 1.0 / math.log2(rank + 1) ideal_hits = min(len(relevant), k) idcg = sum(1.0 / math.log2(r + 1) for r in range(1, ideal_hits + 1)) return dcg / idcg if idcg > 0 else 0.0def evaluate(golden_path, search_fn, k=5): """Average the three metrics; search_fn(question) returns chunk IDs.""" items = [g for g in load_golden(golden_path) if g["answerable"]] scores = {"recall": [], "mrr": [], "ndcg": []} for g in items: retrieved = search_fn(g["question"]) scores["recall"].append(recall_at_k(retrieved, g["relevant_chunks"], k)) scores["mrr"].append(mrr(retrieved, g["relevant_chunks"])) scores["ndcg"].append(ndcg_at_k(retrieved, g["relevant_chunks"], k)) return {name: round(sum(v) / len(v), 3) for name, v in scores.items()}
Pass your own search implementation as search_fn and the output looks like this:
{'recall': 0.86, 'mrr': 0.79, 'ndcg': 0.81}
Decisions That Took Weeks Now Take an Evening
On my setup — 800-character chunks, 100-character overlap, a multilingual embedding model — the starting point was recall@5 = 0.63. Adding NFKC text normalization to queries plus a small synonym expansion for katakana variants lifted it to 0.74. Aligning chunk boundaries to headings instead of fixed lengths brought it to 0.86. Meanwhile, the change I expected the most from — swapping in a larger embedding model — bought +0.02 and wasn't worth the load time. Having the number turn "I think this helps" into "this helped by exactly this much" within an evening changed how I spend optimization time.
Match k to Your Real Context Budget
Before defaulting to recall@5 or recall@10 by convention, check how many chunks your production pipeline actually passes to generation. If the evaluation k and the production context count disagree, you can score "hit" in evaluation while the correct chunk never reaches the prompt in production. I pass five chunks in production, so recall@5 is my primary metric with recall@10 tracked as a reference. A wide gap between the two means correct answers are sinking to ranks six through ten — a signal that reranking could pay off. And if dense retrieval alone plateaus, hybrid setups with a lexical method like BM25 are worth testing — but only after the plateau is confirmed by numbers, not vibes.
Judge the Generation Step With Gemma 4
Even with retrieval landing, generation can ignore its evidence. For the generation layer I use an LLM-as-a-judge setup scoring two axes on a 0–2 scale:
Faithfulness: is every claim in the answer grounded in the provided chunks? Ungrounded additions cost points
Relevance: does the answer address the question head-on? Correct-but-off-target content costs points
import jsonimport requestsJUDGE_PROMPT = """You are a judge for RAG answers.Given question / context / answer, output only this JSON:{"faithfulness": 0-2, "relevance": 0-2, "evidence": "one sentence quoted from context", "note": "issue in 10 words or fewer"}Scoring rules:- faithfulness 2: every claim is grounded in context / 1: some claims ungrounded / 0: main claim ungrounded- relevance 2: answers the question directly / 1: partially / 0: off-target- If context contains no answer and the reply says it cannot answer, score 2 on both axes- For procedural answers, subtract 1 from faithfulness if step order contradicts context"""def judge_answer(question, context, answer, model="gemma-4-27b-it"): """Score faithfulness and relevance with a local Gemma 4 judge.""" payload = { "model": model, "messages": [ {"role": "system", "content": JUDGE_PROMPT}, {"role": "user", "content": json.dumps( {"question": question, "context": context, "answer": answer}, ensure_ascii=False, )}, ], "format": "json", "stream": False, "options": {"temperature": 0}, } res = requests.post("http://localhost:11434/api/chat", json=payload, timeout=120) res.raise_for_status() return json.loads(res.json()["message"]["content"])
Defusing Self-Preference Bias
If the same model generates and judges, it tends to bless its own habits. I generate with the 9B Gemma 4 and judge with the 27B, judge pinned to temperature 0. I also require the judge to quote the exact context sentence it used as evidence. The quote lets a human verify the judgment itself in seconds, which keeps you from trusting the judge blindly.
A quiet advantage of the 30-question scale: you can cross-check every judgment by hand. On my first pass, the judge agreed with my own assessment 87% of the time. The disagreements clustered around step-order mistakes the judge was letting slide, so I added one line about ordering to the rubric and agreement rose to 93%. The judge is itself under evaluation — that mindset keeps the loop honest.
Judging cost is realistic too. On an M2 Max with 64GB of memory running a quantized 27B model, a full 30-question judging pass takes about four minutes. No cloud API means re-runs are free, which makes the nightly batch below an easy decision.
Decide Chunking and Preprocessing With A/B Numbers
Once the harness exists, chunking debates turn from taste into experiments. The axes worth sweeping condense to four:
Chunk size: 400 / 800 / 1200 characters
Split unit: fixed length versus heading- or paragraph-based
Overlap: 0 / 100 / 200 characters
Preprocessing: NFKC normalization, whitespace and decoration cleanup, a small synonym dictionary for orthographic variants
A full grid explodes quickly, so I sweep one axis at a time in the order split unit → preprocessing → size. The runner is short:
In my measurements, the wins ranked: heading-based splitting (recall +0.12), query normalization (+0.11), synonym expansion (+0.07) — while chunk size alone moved things by ±0.03 across 400/800/1200. General advice tends to cast chunk size as the protagonist, but for Japanese support documents, orthographic-variation handling paid far better. I would settle normalization and split unit first and spend size debates last.
One caution: changes that raise recall can lower faithfulness. Finer chunks are easier to hit but strip away surrounding context, and generation misreads fragments more often. Keeping retrieval metrics and judge scores in the same report makes this tug-of-war visible early.
Aggregate Per-Type Scores Into One Report
With three retrieval metrics and two judged axes, attention starts to scatter. I eventually settled on a single text report — a question-type × metric matrix — that I glance at each morning. The aggregation is tiny:
from collections import defaultdictdef report_by_type(golden_items, results): """Print averaged metrics grouped by question type.""" by_type = defaultdict(list) for item, res in zip(golden_items, results): by_type[item["type"]].append(res) for qtype, rows in sorted(by_type.items()): n = len(rows) avg = {k: round(sum(r[k] for r in rows) / n, 2) for k in rows[0]} print(f"{qtype:>12} (n={n}): {avg}")
The first thing I read is not the overall average but the per-type minimum. Negative and out-of-scope types have few questions, so one failure shows up as a large fraction — when those dip, a hallucination-style incident is usually near, and I treat it as an early alarm. When only the factual type drops, I suspect mechanical causes first: a botched index rebuild or drifted chunk IDs. The failure pattern maps to the cause family often enough that triage has become much faster.
Freeze It as a Regression Test and Run It Nightly
A number measured once is trivia; measured after every change, it becomes protection. I wrapped the harness in a pytest threshold gate and run it on every index rebuild, embedding update, and preprocessing change:
Set thresholds at aspirational values and you get a permanently red dashboard nobody reads. I set each floor about 0.05 below the current measured value, then ratchet it upward once an improvement proves stable. The test exists to detect decay, not to declare ambition.
Wiring It Into Antigravity
I registered the test in Antigravity's tasks.json so it runs from the editor with one key, and at night a Background Agent handles execution and record-keeping. The instruction I give the agent deliberately excludes judgment:
Run the evaluation against golden/holdout.jsonl daily at 2:00.
1. Run: pytest tests/test_retrieval_quality.py -q
2. Append the three metrics to eval_logs/YYYY-MM-DD.json
3. If any metric fell 0.05 or more from the previous run, mark the summary "needs review"
Do not judge or fix anything — execute and record only.
Deciding what to change when a floor breaks is human work. When I experimented with letting the agent "improve" things, the improvements drifted toward overfitting the golden set — limiting its role to execution and recording has been far more stable.
Pitfalls That Showed Up in Real Operation
Three months of running this harness surfaced traps I stepped into and a few I narrowly avoided:
Overfitting by reusing the golden set: iterate against the same 30 questions and you build a RAG specialized for those 30. I split mine into 20 tuning questions and a 10-question holdout I only run monthly. An improvement that looked like +0.10 on the tuning set once measured +0.04 on holdout
Watching only the average: the negative/constraint type can fail completely while strong factual scores pull the mean upward. In my first month I had exactly that — average improving, four of five negative questions failing. Always print per-type aggregates
Overfeeding the judge: hand the judge the entire knowledge base and it scores answers as "correct" even when they drew on material retrieval never returned. The judge sees only the chunks retrieved for that question
Embedding version drift: when the index side and the query side run different embedding versions, nothing errors — recall just quietly collapses. I write the model name and version into the index metadata and fail the evaluation hard on mismatch
Penalizing correct refusals: the judge initially docked points when out-of-scope questions were answered with "I can't answer that from the available information" — which is the right behavior. One explicit rubric line fixed it: refusals on unanswerable questions score full marks
Forgetting to pin evaluation reproducibility: with the judge at temperature 0 but the generator still sampling freely, scores wobbled run to run. Evaluation batches now pin the generator to temperature 0 as well. I also once spent a day evaluating stale embedding-cache results that ignored my changes — caches are now keyed by a hash of the configuration
Each fix is one or two lines, but left unnoticed, any of them quietly destroys your trust in the numbers. The evaluation rig itself belongs on the inspection list.
The Next Step — Write Down 30 Questions
Every implementation above fits in a few dozen lines, but the order matters more than the code. Before writing any of it, spend today pulling 30 questions out of your support log or operations notes. The act of writing them down doubles as an audit of where your knowledge base is thin. The harness can wait until tomorrow.
An evaluation rig sounds like overkill for a solo project — I thought so too. But even a 30-question set changed the quality of my decisions out of proportion to its size. Building apps independently since 2014 — across a portfolio that has passed 50 million downloads — the habit that has saved me most often is writing tests, and RAG evaluation turned out to be the same investment in a new shape: small, quiet, and reliable. If you are growing a local RAG of your own, I hope this saves you a few of the detours it saved me.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.