ANTIGRAVITY LABJP
Articles/AI Tools
AI Tools/2026-06-12Advanced

Cutting Down 'Plausible but Wrong' RAG Answers — A Retrieval Evaluation Harness for Gemma 4 and Antigravity

Replace gut feeling with recall@5, MRR and faithfulness scores — a 30-question golden dataset and a small Python harness for evaluating a local Gemma 4 RAG stack.

antigravity340gemma-419rag8local-llm16evaluation2

Premium Article

In the spring of 2026, I was wiring up a RAG system over the support knowledge base for the wallpaper apps I run as an indie developer. A user asked why the Android version showed a blank white screen after switching themes. The retrieval step confidently handed the model a chunk about restoring purchases on iOS. The generated answer read beautifully — polite, structured, plausible. It also had nothing to do with the question.

That is what makes "plausible but wrong" RAG answers so dangerous: they fail silently. When retrieval misses, the language model doesn't throw an error. It takes whatever chunks it received and weaves them into fluent prose, and the fluency hides the failure. When I later audited my error logs, roughly 70% of the bad answers traced back to retrieval missing the right document entirely — not to the generation step I had been busy tuning.

Instead of adopting a heavyweight evaluation platform, I built a small harness: a 30-question golden dataset and a few dozen lines of Python. My recall@5 went from 0.63 to 0.86 through preprocessing and chunking changes I could finally verify, and more importantly, I gained the habit of checking numbers after every change. This article walks through that harness, end to end, on a local Gemma 4 stack. If you haven't built the pipeline itself yet, Building a RAG Pipeline with Gemma 4 and ChromaDB: Implementation covers that ground — here we focus on measuring and improving a RAG that already runs.

Why Fluent Generation Hides Retrieval Failures

RAG quality work goes in circles unless you separate two questions: did retrieval fetch the right material, and did generation stay faithful to it? My own first mistake was exactly this. I kept polishing prompts because the answers looked like generation problems. Only when I lined up the retrieval logs did I see that the correct chunk wasn't in the top five at all for most failures. Prompt work is the final polish you apply after retrieval is verified — do it earlier and you are tuning noise.

The two layers fail in characteristically different ways:

  • Retrieval failures: weak handling of paraphrases, orthographic variation (in Japanese support text, full-width vs. half-width characters and katakana variants; in English, product-name synonyms and abbreviations), and out-of-scope questions that get matched to the "closest" chunk anyway
  • Generation failures: dropping conditions or caveats during summarization even though the right chunk was provided, or blending two chunks into a procedure that never existed

Because the fixes are completely different, the evaluation needs two layers as well: mechanical ranking metrics for retrieval, and a judged faithfulness check for generation. That separation is the backbone of everything below.

One practical note before any framework: just logging retrieval in a human-readable form is a real step forward. I keep one JSON line per query — the question, the top-five chunk IDs with scores, and the final answer. When a wrong answer gets reported, identifying which layer failed takes minutes instead of an afternoon. The harness in this article is an extension of that log, not a replacement for it. For the broader pipeline patterns around vector search, Building a RAG Pipeline with Antigravity — Unlock Your Company's Knowledge with Vector Search and LLMs is a useful companion read.

Start With a 30-Question Golden Dataset

Evaluation suggests images of benchmark suites with hundreds of items. For an indie-scale RAG, 30 questions is genuinely enough to start — what matters is the mix of question types, not the count. I use four categories:

  1. Factual (12 questions): the answer sits verbatim in one chunk — "How much is the ad-free version?"
  2. Procedural (8 questions): multi-step explanations — "How do I restore my purchase after switching phones?"
  3. Negative/constraint (5 questions): the correct answer states that something is not supported — "Does the app support lock-screen widgets?"
  4. Out-of-scope (5 questions): nothing in the knowledge base answers this, and the correct behavior is to say so

The dataset is JSONL, one question per line, which keeps per-type aggregation trivial later:

{"id": "q001", "type": "factual", "question": "How much is the ad-free version?", "relevant_chunks": ["pricing-adfree-001"], "answerable": true}
{"id": "q013", "type": "procedural", "question": "How do I restore purchases after moving to a new phone?", "relevant_chunks": ["restore-purchase-001", "restore-purchase-002"], "answerable": true}
{"id": "q021", "type": "negative", "question": "Does the app support lock screen widgets?", "relevant_chunks": ["feature-scope-003"], "answerable": true}
{"id": "q026", "type": "out_of_scope", "question": "Can I import wallpapers from other apps?", "relevant_chunks": [], "answerable": false}

Harvest Questions From Real User Phrasing

The single biggest factor in this dataset's value is not inventing the questions yourself. My primary sources were App Store and Google Play review replies and support emails. Real users phrase things in ways the author of the documentation never would. "The screen goes white" also arrived as "it turns blank," "the display disappears," and "it freezes and shows nothing." Those variations are exactly the paraphrase attacks your retrieval needs to survive, and they cost nothing to collect — they are already in your inbox.

Give Correct Chunks Stable IDs

The IDs in relevant_chunks must survive index rebuilds. If you label answers with sequential chunk numbers, every change to your splitting configuration silently invalidates the whole dataset. I use a string built from the document slug plus a heading slug (for example restore-purchase-001), so even when chunk boundaries move, labels can be re-derived from the document-and-heading mapping.

Always Include Out-of-Scope Questions

RAG systems lean structurally toward answering something. Failures on questions that should not be answered barely move an average score, which is why they go unnoticed. Five out-of-scope questions act as a small but reliable hallucination detector, especially combined with the judge rubric described below.

Grow the Dataset a Few Questions per Month

Thirty questions is a starting point, not a destination. I add two or three per month, picking newly arrived support questions whose phrasing looks likely to defeat retrieval. The dataset lives in git, so score changes can be matched against dataset commits — which makes it obvious when a dip means "the dataset got harder," not "the RAG got worse." Three months in, mine sits at 42 questions and is noticeably more trustworthy as a detector.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Replace the vague sense that your RAG sometimes misses with numbers you can track over time — recall@5, MRR, and nDCG
Build a 30-question golden dataset and a dependency-light Python evaluation harness that drops into any local Gemma 4 stack
Turn chunking and embedding choices into regression-tested decisions, wired into pytest and a nightly Antigravity background task
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

AI Tools2026-05-10
Gemma 4 on Antigravity: Picking Q4 vs Q5 — What I Found After a Week on M2 Mac
A hands-on comparison of Gemma 4 quantization variants (Q4_K_M / Q5_K_M / Q8_0 / fp16) running locally with Antigravity on a 16GB M2 Mac, measured across speed, memory, and output quality.
AI Tools2026-04-24
A Daily Workflow for Using LM Studio with Antigravity — Model Selection, Wiring, and Everyday Practice
A practical guide to making LM Studio your everyday model provider inside Antigravity — how to pick a model, wire up the OpenAI-compatible server, and survive the small surprises that come with daily use.
AI Tools2026-04-22
Running Multiple Gemma 4 LoRAs in Production — A Practical Guide to Merging and Dynamic Adapter Switching
You've trained three LoRAs on Gemma 4 — one for summarization, one for translation, one for code review. Now the real question: how do you serve them in production without tripling your GPU bill? This is my working notebook on merging and dynamic switching, written with Antigravity alongside.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →