⚙ AI Tools/2026-04-21Advanced

Prompts Are Assets: Building a Production-Grade Prompt Management Platform with Antigravity — Versioning, A/B Testing, and Quality Evaluation

A hands-on implementation guide for treating prompts as first-class code — with versioning, A/B testing, and automated quality evaluation. Design patterns and working code for running AI agents on Antigravity with safe, continuous prompt improvement.

antigravity⁴³⁶ prompt-engineering⁷ versioning² a-b-testing² production⁷¹

✦ Premium Article

After running AI agents on Antigravity in production for roughly half a year, the most painful incident I've faced was this: a small prompt tweak silently broke response quality for a specific customer's use case. The edit was a single character. The diff was two lines. But because I had no way to trace which prompt produced which response in the wild, it took three days of manual investigation to find the cause.

That experience taught me something I wish I'd learned earlier: prompts are not "config files" or "magic strings" — they are first-class assets that deserve the same rigor we give code: version control, tests, and observability. In this article I'll open up the prompt management platform I've been building on top of Antigravity, from design philosophy down to working code. Once this infrastructure is in place, your prompt iteration cycle becomes dramatically faster and safer at the same time.

Why Prompts Need to Be Managed Like Code

Treating prompts as plain strings is blazingly fast in the prototype phase. But once you're in production, four specific pains start showing up, every single time.

First, the lack of reproducibility. When a user says "this was working correctly last week but not today," you cannot debug without knowing which prompt produced which response. Since prompts get nudged daily, Git commits alone are not enough — you need per-response traceability that says "this output came from this prompt version."

Second, the absence of comparative validation. Whether a new prompt is actually better than the old one is not something your gut can judge. I once replaced a prompt with what I was certain was a more natural phrasing, only to discover two weeks after deploy that it had dropped accuracy by 10 percent. Without an A/B testing mechanism, it is genuinely common to degrade quality while believing you're improving it.

Third, the difficulty of cost observation. Prompt length, number of few-shot examples, and output format all directly drive API cost. Prompts grow over time, and it is not unusual to realize six months later that monthly spend has tripled. Catching this early requires recording "tokens consumed per prompt version," which is impossible if prompts live embedded inside application code.

Fourth, the challenge of safe rollback. When something breaks, the most natural reaction is "let's just go back to yesterday's version." But if prompts live inside your codebase, rollback means running a new deploy. If prompts live in a separate store, rollback takes seconds.

What we'll build here is the minimum viable platform that addresses all four pains at once. I've deliberately kept features modest — the goal is something you can introduce into your own project within a month, not an elaborate framework.

Architecture: Four Layers of Separated Responsibility

The core of the design is separation of concerns. Splitting into these four layers makes the system extensible later and dramatically easier to test.

Store layer — holds prompt versions and their metadata. YAML files or a database.
Router layer — decides which version to use for each request. Implements weighted routing to enable A/B testing.
Executor layer — sends the selected prompt to the LLM and returns the response with metrics (token count, latency).
Evaluator layer — computes quality scores for responses and persists them. Runs either in batch or real-time.

This separation matters because it lets you swap each layer independently. Migrating the store from YAML to Postgres requires no changes from the router onward. Changing the evaluator's algorithm from rule-based to LLM-as-a-Judge leaves the store and executor untouched. When your Antigravity agents call these layers, clean interfaces mean nobody gets confused about what goes where.

I have a personal reason for believing in this design: I've screwed it up once before. My first attempt was "one file is enough," written as a single monolithic module. Three months later it was a giant ball of mud containing five evaluation metrics and two storage backends. As I covered in the Evaluation Framework Guide for Production AI Agents on Antigravity, evaluation logic in particular deserves its own module from day one.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Escape the fear that every prompt tweak might silently break production; gain version control and instant rollback so you can iterate with confidence

✦Run multiple prompt versions in production simultaneously and let A/B testing tell you — with real numbers — which one actually works better

✦Track how each prompt change affects response quality, latency, and cost, and produce weekly quality reports stakeholders can actually trust

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Versioning: A YAML-Based Store Implementation

The first thing to implement is the store layer. For small-to-medium projects, a YAML file on disk is plenty, and putting it under Git gives you history, PR review, and rollback for free.

Here's an example schema, using prompts/summarize_article.yaml:

# prompts/summarize_article.yaml
# This file is the source of truth for the prompt
id: summarize_article
versions:
  - version: v1
    status: stable            # stable / canary / deprecated
    weight: 0.8               # probability of being chosen by the router
    created_at: 2026-03-15
    author: masaki
    template: |
      You are a professional editor. Summarize the article below in 3 sentences.
      Article: {article}
      Summary:
    model: gemini-2.5-pro
    max_tokens: 256
    temperature: 0.3
 
  - version: v2
    status: canary
    weight: 0.2
    created_at: 2026-04-20
    author: masaki
    description: "Changed 'professional editor' to 'seasoned writer' and added a guard against bulleted output"
    template: |
      You are a seasoned writer. Summarize the article below in 3 sentences.
      Always respond in plain prose; do not use bullet points or headings.
      Article: {article}
      Summary:
    model: gemini-2.5-pro
    max_tokens: 256
    temperature: 0.3

Three things matter in this schema. The status field explicitly declares "is this safe to use in production." The weight controls the probability with which the router picks each version. And the description forces you to record why the change was made. That last one sounds trivial, but it is the safety net that saves you from staring at v3 six months later and asking, "wait, why did I create this again?"

Next, the loader — Python with Pydantic for type safety. Antigravity's editor completion works beautifully against typed models.

# src/prompt_store.py
# Loads and validates the prompt store from a directory of YAML files
from __future__ import annotations
from pathlib import Path
from typing import Literal
import yaml
from pydantic import BaseModel, Field, field_validator
 
class PromptVersion(BaseModel):
    """Definition of a single prompt version"""
    version: str
    status: Literal["stable", "canary", "deprecated"]
    weight: float = Field(ge=0.0, le=1.0)  # probability in [0, 1]
    created_at: str
    author: str
    template: str
    model: str
    max_tokens: int = Field(gt=0)
    temperature: float = Field(ge=0.0, le=2.0)
    description: str | None = None
 
class PromptDefinition(BaseModel):
    """A logical prompt that holds multiple versions"""
    id: str
    versions: list[PromptVersion]
 
    @field_validator("versions")
    @classmethod
    def weights_must_sum_to_one(cls, versions: list[PromptVersion]) -> list[PromptVersion]:
        # If non-deprecated weights don't sum to 1.0, the router's math breaks
        active = [v for v in versions if v.status \!= "deprecated"]
        total = sum(v.weight for v in active)
        if abs(total - 1.0) > 1e-6:
            raise ValueError(
                f"active weight sum must be 1.0, got {total:.4f}. "
                f"Check prompt id={versions[0].version if versions else 'unknown'}"
            )
        return versions
 
class PromptStore:
    """Loads an entire YAML directory into memory"""
 
    def __init__(self, root: Path) -> None:
        self.root = root
        self._prompts: dict[str, PromptDefinition] = {}
        self._load()
 
    def _load(self) -> None:
        for yaml_file in self.root.glob("*.yaml"):
            try:
                data = yaml.safe_load(yaml_file.read_text(encoding="utf-8"))
                prompt = PromptDefinition.model_validate(data)
                self._prompts[prompt.id] = prompt
            except Exception as e:
                # One broken YAML should not take down the whole store.
                # In production, hook this into your monitoring.
                print(f"[prompt_store] failed to load {yaml_file.name}: {e}")
 
    def get(self, prompt_id: str) -> PromptDefinition:
        if prompt_id not in self._prompts:
            raise KeyError(f"prompt not found: {prompt_id}")
        return self._prompts[prompt_id]
 
    def list_ids(self) -> list[str]:
        return sorted(self._prompts.keys())
 
# Smoke test
if __name__ == "__main__":
    store = PromptStore(Path("prompts"))
    print("loaded:", store.list_ids())
    p = store.get("summarize_article")
    print(f"versions: {[v.version for v in p.versions]}")
    # Expected output:
    # loaded: ['summarize_article']
    # versions: ['v1', 'v2']

The crucial part here is the weights_must_sum_to_one validator. It is very easy to casually write weight: 0.5 and forget the rest; if the weights of active versions don't sum to 1.0, the router's probability math silently goes wrong. Failing at startup is the most merciful option, so we enforce it strictly with a Pydantic field validator.

A/B Test Runner: Weighted Routing with Statistical Collection

Next, implement a router that uses weight to distribute real requests. The key property you want is that the same user should always be routed to the same version, which keeps your training data consistent.

# src/prompt_router.py
# Router that selects a prompt version for each request
from __future__ import annotations
import hashlib
import random
from dataclasses import dataclass
from src.prompt_store import PromptStore, PromptVersion
 
@dataclass
class RoutingResult:
    version: PromptVersion
    routing_key: str  # log this for traceability
 
class PromptRouter:
    def __init__(self, store: PromptStore) -> None:
        self.store = store
 
    def select(self, prompt_id: str, user_id: str | None = None) -> RoutingResult:
        """
        When user_id is given, routing is 'sticky':
        the same user always receives the same version.
        This keeps learning data — and user experience — consistent.
        """
        prompt = self.store.get(prompt_id)
        active = [v for v in prompt.versions if v.status \!= "deprecated"]
        if not active:
            raise RuntimeError(f"no active versions for {prompt_id}")
 
        # Sticky routing: hash user_id to a value in [0.0, 1.0)
        if user_id:
            h = hashlib.sha256(f"{prompt_id}:{user_id}".encode()).hexdigest()
            r = int(h[:8], 16) / 0xFFFFFFFF
        else:
            # Without user_id (batch jobs, etc.), fall back to random
            r = random.random()
 
        cumulative = 0.0
        for v in active:
            cumulative += v.weight
            if r < cumulative:
                return RoutingResult(version=v, routing_key=f"{prompt_id}:{v.version}")
 
        # Fallback for floating point rounding at the edges
        return RoutingResult(version=active[-1], routing_key=f"{prompt_id}:{active[-1].version}")
 
# Smoke test: check the distribution over 10,000 users
if __name__ == "__main__":
    from pathlib import Path
    from collections import Counter
    store = PromptStore(Path("prompts"))
    router = PromptRouter(store)
    counts = Counter()
    for i in range(10000):
        result = router.select("summarize_article", user_id=f"user_{i}")
        counts[result.version.version] += 1
    print(counts)
    # Expected output (approximately):
    # Counter({'v1': 8000, 'v2': 2000})  (within ±1%)

Sticky routing matters for two reasons. One is data consistency, as mentioned above. The other is user experience consistency — if the same user asks the same question and gets wildly different response styles every time, the product just feels bad. Hash-based routing solves both problems at once.

Evaluation Pipeline: Automated Quality Scoring

Once you can split traffic, the next question is "which version is actually better?" That's the job of the evaluator layer. There are three approaches worth knowing.

Rule-based evaluation checks objective criteria — output format, forbidden words, character count. It is fast and cheap, so we run it on every CI build. LLM-as-a-Judge asks a strong model (e.g., Gemini 2.5 Pro) to "rate this response from 0 to 10" — great for turning subjective quality into numbers, but it costs real money. Human evaluation is the most trustworthy, but throughput is low, so you reserve it for sampling.

In my experience, combining all three is the most cost-effective strategy. Run rule-based on everything, LLM-as-a-Judge on a 10% sample, and human eval weekly to calibrate. Here's a minimal LLM-as-a-Judge implementation.

# src/evaluator.py
# Rates response quality from 0 to 10 using an LLM judge
from __future__ import annotations
import json
import os
from dataclasses import dataclass
from typing import Any
import google.generativeai as genai
 
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])  # env var required
 
JUDGE_PROMPT = """\
You are a strict evaluator of AI response quality.
Rate the "AI response" below against the "input" on a 0-10 scale.
 
Criteria:
- Relevance (match with the query): 0-3 pts
- Factuality (absence of hallucination): 0-4 pts
- Clarity (readability): 0-3 pts
 
Return total, per-criterion scores, and a brief reason as JSON.
Output only the JSON object; nothing else.
 
Input: {query}
AI response: {response}
 
Output format:
{{"total": 8, "relevance": 3, "factuality": 3, "clarity": 2, "reason": "..."}}
"""
 
@dataclass
class EvalResult:
    total: float
    relevance: float
    factuality: float
    clarity: float
    reason: str
 
def evaluate_response(query: str, response: str, judge_model: str = "gemini-2.5-pro") -> EvalResult:
    """Evaluate a response with LLM-as-a-Judge"""
    model = genai.GenerativeModel(judge_model)
    prompt = JUDGE_PROMPT.format(query=query, response=response)
    try:
        result = model.generate_content(prompt, generation_config={"temperature": 0.0})
        # Sometimes ``` markdown wrappers sneak in; strip them
        text = result.text.strip().removeprefix("```json").removeprefix("```").removesuffix("```").strip()
        data: dict[str, Any] = json.loads(text)
        return EvalResult(
            total=float(data["total"]),
            relevance=float(data["relevance"]),
            factuality=float(data["factuality"]),
            clarity=float(data["clarity"]),
            reason=data.get("reason", ""),
        )
    except json.JSONDecodeError as e:
        # The LLM occasionally returns non-JSON. We fall back rather than retry.
        return EvalResult(total=0.0, relevance=0.0, factuality=0.0, clarity=0.0,
                          reason=f"judge output parse error: {e}")
    except Exception as e:
        # API errors: return -1 so callers can distinguish "eval failed" from "zero score"
        return EvalResult(total=-1.0, relevance=-1.0, factuality=-1.0, clarity=-1.0,
                          reason=f"judge error: {type(e).__name__}: {e}")
 
# Smoke test
if __name__ == "__main__":
    r = evaluate_response(
        query="What's the weather in Tokyo today?",
        response="Tokyo today is mostly sunny with a high of 22C."
    )
    print(f"total={r.total}, reason={r.reason}")
    # Expected output example:
    # total=8.0, reason="High relevance, but the source of weather data is unclear — factuality dinged"

Using temperature=0.0 on the judge is deliberate: it makes evaluation reproducible. If the judge itself flickers, A/B comparisons lose meaning. And returning total=-1.0 on errors (instead of 0.0) lets callers distinguish "we failed to evaluate" from "we evaluated and scored zero" — an important distinction when analyzing weekly quality trends.

Together, these three pieces let you record, per user, which version produced which response at what quality score. Pipe the logs into BigQuery or Postgres, and you can answer "is the average quality difference between v1 and v2 statistically significant?" every week with a single SQL query.

For integration with a broader observability stack, I've written more in Antigravity × OpenTelemetry: Building an AI Observability Pipeline. Read it alongside this article when you take the platform to production.

Editing Production Prompts Safely from the Antigravity Editor

Once the platform is in place, the next question is operations. Unless the experience of editing prompts is safe for anyone on the team, all that infrastructure goes to waste. Here's the workflow I use with Antigravity.

First, a prompt-specific branch naming convention. I follow prompt/{prompt-id}/{short-description} so that PR titles automatically include the prompt ID. Then I ask the Antigravity agent: "summarize the YAML diff in plain language and draft the commit message for me." That way, the reviewer can immediately see the intent behind the change, not just the mechanical diff.

Second, PR templates that paste evaluation results. A CI script runs v1 and the new version against 50 identical inputs and compares quality scores. The results are auto-commented onto the PR, and the rule "block merge if the new version drops 0.3 points or more on average" is enforced by GitHub Actions. This one piece of friction prevents the "I thought I was improving it, but actually regressed" accident.

Third, the canary status pays for itself. Every new version ships as a weight: 0.1 canary first. It runs for 48 hours, then — only if metrics look healthy — gets promoted to stable. Even if something breaks, the blast radius is under 10%, and rollback means flipping weight: 0.0, which takes effect immediately.

What makes Antigravity shine here is that editor, prompts, and evaluation pipeline all live in the same workspace. You edit a prompt, then ask the agent: "compare this against v1 on the same 50 inputs and show me where the biggest diffs are." Ten minutes later you know whether you're on the right track. No more tab-switching between IDE, terminal, and spreadsheet — your train of thought stays intact.

For the broader operational context, see Antigravity LLMOps: A Practical Guide to Monitoring and Operating AI Models in Production. This management platform is really one piece of a larger LLMOps discipline.

Production Pitfalls and Recovery Strategies

Here are the traps I've actually fallen into over several months of running this, along with the fixes that worked. Knowing these in advance saves you a lot of pain.

Pitfall 1: shipping with weights that don't sum to 1.0. This happens more than you'd think. Someone writes "let me force v2 to 100% temporarily; I'll set v1 to 0.0," and then forgets to restore v1's weight. The fix is to run the YAML validator in CI and fail the build on any violation. Wire the weights_must_sum_to_one check into pre-commit as well for double protection.

Pitfall 2: template variables drifting out of sync with caller keys. A classic case: you rename {article} to {content} in the prompt, but the Python code still passes article=. The fix is to extract required variables at load time and check them against what the executor provides. In Python, string.Formatter().parse() gives you the variable names, and set(template_vars) <= set(provided.keys()) does the guarding.

Pitfall 3: LLM-as-a-Judge cost exceeds your budget. Running the judge on every request makes evaluation cost rival production cost. The fix is an explicit sampling strategy. My rule: right after launching a new version, sample at 50%; once it stabilizes, drop to 5%. You cut cost by about 10x while preserving almost all statistical power.

Pitfall 4: "just flip the weight" isn't always enough for rollback. With sticky routing, users already bucketed into a broken version keep hitting it. You need a "forced reshuffle" mechanism. I add a force_reshuffle_after: 2026-04-20T10:00 field per version, and the router uses this timestamp as additional hash input — any request after that time gets a fresh bucket.

Pitfall 5: prompt changes without matching evaluation criteria updates. If you change a prompt to "answer in bullet points" but your evaluator still rewards "fluent prose," scores will tank for the wrong reason. The fix is simple but cultural: any prompt change PR must include a "do we need to update evaluation criteria?" checkbox, and reviewers actually have to tick it. I've watched this single checkbox cut incident count dramatically.

Closing: Prompt Management Is About Habits, Not Features

The single first step to take today is this: take one prompt that is currently buried inside your application code, and move it out to a YAML file in prompts/. Nothing else. That alone unlocks Git diff history, reviewable changes, and the ability to leave comments. Extracting the first prompt takes maybe 30 minutes. Setting up a PR template takes an hour. Adding the A/B router takes half a day. If your project is going to live another six months, this investment pays itself back many times over.

Continuously improving AI features in production requires elevating prompts from "a craft that one person tunes by feel" to "a development process a team can safely iterate on." The infrastructure for that shift turns out to be surprisingly small. Don't aim for perfection — start with one YAML file and a simple validator. Keep stacking from there, and six months from now you'll find you've become "a team that isn't afraid of prompt changes."

Executor Layer: Actually Running the Selected Prompt

The glue between Store, Router, and Evaluator is the Executor layer. Its job is to actually call the LLM with the version the router picked, record token consumption, latency, and cost, and return a response object the evaluator can consume. It's easy to treat this layer as trivial, but this is where your observability core actually lives.

# src/executor.py
# Executes selected prompts and records metrics for every run
from __future__ import annotations
import time
import json
from dataclasses import dataclass, asdict
from pathlib import Path
import google.generativeai as genai
from src.prompt_router import RoutingResult
 
@dataclass
class ExecutionRecord:
    """One observation record per request"""
    prompt_id: str
    version: str
    user_id: str | None
    query: str
    response: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    error: str | None = None
 
def _estimate_cost_usd(input_tokens: int, output_tokens: int, model: str) -> float:
    """Multiplies by per-model per-million-token unit prices"""
    # In production, move this pricing table out to YAML
    table = {
        "gemini-2.5-pro": (0.375, 1.125),   # (input/1M, output/1M) USD
        "gemini-2.5-flash": (0.075, 0.225),
    }
    ipm, opm = table.get(model, (0.3, 1.0))
    usd = (input_tokens / 1_000_000) * ipm + (output_tokens / 1_000_000) * opm
    return round(usd, 6)
 
class PromptExecutor:
    def __init__(self, log_path: Path = Path("logs/prompt_runs.jsonl")) -> None:
        self.log_path = log_path
        self.log_path.parent.mkdir(parents=True, exist_ok=True)
 
    def run(self, routing: RoutingResult, variables: dict[str, str],
            user_id: str | None = None) -> ExecutionRecord:
        v = routing.version
        # Missing template variable check (mitigates Pitfall 2)
        try:
            prompt_text = v.template.format(**variables)
        except KeyError as e:
            raise ValueError(
                f"missing template variable {e} for prompt {routing.routing_key}"
            ) from e
 
        model = genai.GenerativeModel(v.model)
        start = time.perf_counter()
        error: str | None = None
        response_text = ""
        input_tokens = output_tokens = 0
 
        try:
            response = model.generate_content(
                prompt_text,
                generation_config={
                    "temperature": v.temperature,
                    "max_output_tokens": v.max_tokens,
                },
            )
            response_text = response.text
            # Pull tokens out of google-generativeai's usage_metadata
            um = getattr(response, "usage_metadata", None)
            if um:
                input_tokens = getattr(um, "prompt_token_count", 0)
                output_tokens = getattr(um, "candidates_token_count", 0)
        except Exception as e:
            error = f"{type(e).__name__}: {e}"
 
        latency_ms = (time.perf_counter() - start) * 1000.0
        record = ExecutionRecord(
            prompt_id=routing.routing_key.split(":")[0],
            version=v.version,
            user_id=user_id,
            query=variables.get("article", "")[:100],  # keep only first 100 chars
            response=response_text,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            latency_ms=latency_ms,
            error=error,
        )
        self._persist(record)
        return record
 
    def _persist(self, rec: ExecutionRecord) -> None:
        # Append JSONL. In production, ship this to BigQuery/Postgres instead.
        with self.log_path.open("a", encoding="utf-8") as f:
            f.write(json.dumps(asdict(rec), ensure_ascii=False) + "\n")
 
# Smoke test (requires API key)
if __name__ == "__main__":
    import os
    from pathlib import Path
    from src.prompt_store import PromptStore
    from src.prompt_router import PromptRouter
 
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
    store = PromptStore(Path("prompts"))
    router = PromptRouter(store)
    executor = PromptExecutor()
 
    routing = router.select("summarize_article", user_id="user_42")
    record = executor.run(routing, {"article": "Cherry blossoms hit full bloom in Tokyo yesterday."}, user_id="user_42")
    cost = _estimate_cost_usd(record.input_tokens, record.output_tokens, "gemini-2.5-pro")
    print(f"version={record.version}, latency={record.latency_ms:.0f}ms, cost=${cost}")
    # Expected output example:
    # version=v1, latency=820ms, cost=$0.000780

The critical habit here is never swallowing errors silently — always record them in ExecutionRecord.error. When analyzing trends, you need failures sitting next to successes in the same log. Otherwise you won't notice "failure rate is spiking in the last hour" until customers complain. And why record only the first 100 characters of query? It's a compromise: keeps PII leakage bounded while still letting you do aggregate analysis. Hash or mask further if your data classification requires it.

Integrating with CI/CD: Automatic Quality Diff Reports on Every PR

Running evaluations by hand gets tiresome after the first few times. Once you wire it into GitHub Actions, every PR triggers an evaluation and the results get posted as a comment. Here's a simplified version of the workflow I use in production.

# .github/workflows/prompt-eval.yml
name: Prompt Evaluation
on:
  pull_request:
    paths:
      - "prompts/**.yaml"
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
 
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
 
      - name: Install deps
        run: pip install pydantic pyyaml google-generativeai
 
      - name: Run comparative evaluation
        env:
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
        run: |
          python scripts/compare_versions.py \
            --prompt-id summarize_article \
            --baseline-ref origin/main \
            --candidate-ref HEAD \
            --sample-size 50 \
            --output eval_report.md
 
      - name: Comment on PR
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          path: eval_report.md
 
      - name: Fail if quality regressed
        run: |
          SCORE_DIFF=$(jq '.diff' eval_report.json)
          if (( $(echo "$SCORE_DIFF < -0.3" | bc -l) )); then
            echo "Quality regressed by $SCORE_DIFF — blocking merge"
            exit 1
          fi

The paths filter is deliberate — it only triggers on changes under prompts/, so unrelated PRs don't burn API budget. The -0.3 threshold is what I've empirically found represents "clearly regressed." Tune it for your project: too strict and improvement PRs get blocked, too loose and regressions sneak through.

Six months into running this, I've come to believe that "every PR mechanically gets a quality report attached" actually changes how a team thinks about prompts. When numbers are right there, people naturally start asking "why did this change drop the score?" Without numbers, prompt reviews collapse into subjective arguments.

Migration Playbook: How to Introduce This to an Existing Project

Here's the realistic sequence I actually used — not the idealized version — to roll this out on a live project without breaking anything.

Phase 1: externalize writes only. First, dump every prompt to YAML under prompts/, but leave the string literals in application code untouched. The app still doesn't read from the store, so behavior is unchanged. The goal of this phase is to get prompts into Git and establish a review habit. Run it for a week or two until "touching a prompt means opening a PR" feels natural to the team.

Phase 2: migrate one prompt to read from the store. Pick a low-stakes prompt — internal debug tooling, for example — and replace the string literal with prompt_store.get("xxx"). If something breaks here, the blast radius is tiny, which is exactly what you want for surfacing implementation bugs. My first migration exposed both a status typo and a case where weights didn't sum to 1.0 — which is why the CI validator now feels non-negotiable.

Phase 3: migrate the rest gradually. Two or three prompts per week is plenty. Migrating everything at once makes incident triage much harder. This is also when you introduce the A/B router and establish the "new versions always start as canary" rule.

Phase 4: integrate the evaluation pipeline with production. Only now do you wire logs into BigQuery or Postgres and automate weekly reports. Dashboard work feels unglamorous and always slips to "later," but whether you have it or not is the difference between "we operate on feel" and "we operate on numbers."

SQL for the Dashboard: Queries You'll Actually Use Daily

Finally, three queries you'll reach for constantly once evaluation logs start piling up. Paste them into Looker Studio or Metabase; glancing at them every morning tells you the health of your prompt operation at a glance.

-- Query 1: average quality score and latency per version, last 24 hours
SELECT
  prompt_id,
  version,
  COUNT(*) AS requests,
  AVG(eval_total) AS avg_score,
  AVG(latency_ms) AS avg_latency_ms,
  SUM(input_tokens + output_tokens) AS total_tokens
FROM prompt_runs
WHERE ts >= CURRENT_TIMESTAMP() - INTERVAL 24 HOUR
  AND error IS NULL
GROUP BY prompt_id, version
ORDER BY prompt_id, avg_score DESC;

-- Query 2: statistical significance check for A/B tests (Welch's t-test, simplified)
-- Tells you whether the canary is actually better than stable, numerically
WITH stats AS (
  SELECT
    version,
    AVG(eval_total) AS mean,
    STDDEV(eval_total) AS stddev,
    COUNT(*) AS n
  FROM prompt_runs
  WHERE prompt_id = 'summarize_article'
    AND ts >= CURRENT_TIMESTAMP() - INTERVAL 7 DAY
    AND error IS NULL
  GROUP BY version
)
SELECT
  a.version AS version_a, a.mean AS mean_a,
  b.version AS version_b, b.mean AS mean_b,
  (a.mean - b.mean) / SQRT((a.stddev*a.stddev)/a.n + (b.stddev*b.stddev)/b.n) AS t_stat
FROM stats a JOIN stats b ON a.version < b.version;
-- |t_stat| > ~2 is a rough cutoff for 95% confidence

-- Query 3: hourly error timeline — did failure rate spike after a new version deploy?
SELECT
  DATE_TRUNC(ts, HOUR) AS hour,
  version,
  COUNT(*) AS total,
  SUM(CASE WHEN error IS NOT NULL THEN 1 ELSE 0 END) AS errors,
  ROUND(SUM(CASE WHEN error IS NOT NULL THEN 1.0 ELSE 0.0 END) * 100 / COUNT(*), 2) AS error_rate_pct
FROM prompt_runs
WHERE ts >= CURRENT_TIMESTAMP() - INTERVAL 48 HOUR
GROUP BY hour, version
ORDER BY hour DESC, version;

Wire these three into a daily-check dashboard and most quality issues will surface before users notice them. I personally have an alert that fires on Slack the moment a canary's error rate exceeds 2x the stable version's — that one line of Query 3 automation has pretty much eliminated my 3am incident pages.

Choosing Your Storage Backend: When YAML Stops Being Enough

Starting with YAML was deliberate — it gives you 90% of the value with 10% of the complexity. But at some point you'll outgrow it. Here's how I think about the progression.

YAML-in-Git is the right choice when prompt count is under roughly 50, the team has at most a few engineers who actively modify prompts, and you don't need non-engineers (product, content, localization) editing in real time. The killer feature is that every change is a PR, which means every change gets reviewed.

You'll feel the pinch when three things start happening. First, non-engineers need to edit prompts, and teaching everyone Git is not worth the cost. Second, prompts need to change at runtime — for example, per-tenant customization in a multi-tenant SaaS. Third, you start storing evaluation metadata directly on each version (who approved it, when, what its historical score was), and YAML files get unwieldy.

At that point, I recommend moving to a database — Postgres works beautifully — but keep the Pydantic schema you already have. Store the same PromptDefinition shape in JSONB columns, and your PromptStore implementation only needs a new backend; the rest of the pipeline is unchanged. This is where the four-layer separation pays real dividends.

One common trap when moving to a database: don't forget the Git pull request workflow. It's tempting to build a fancy web UI for prompt editing, skip PR reviews, and let anyone save changes directly. Six months later, you'll have 200 versions and no idea which ones are actually used. The fix is to treat your database like a staging area — changes still need approval before status flips from canary to stable. The approval workflow can live in the same database (an approvals table), but it has to exist.

Choosing Your Judge Model: A Cost-Quality Tradeoff

The LLM-as-a-Judge implementation earlier used Gemini 2.5 Pro as the judge. That's a deliberate but expensive choice. Let me share how I think about this tradeoff, because it's worth several thousand dollars a year if you get it wrong.

The judge needs three properties: consistent scoring (two calls with the same inputs should produce nearly identical scores), calibration (scoring should track human judgment at least loosely), and coverage (the judge should understand the domain — if you're grading medical summaries, a general-purpose model is riskier than a domain-tuned one).

In practice I've used three configurations. Gemini 2.5 Pro as the judge is my default for production traffic. It's accurate and its calibration against human scores sits above 0.8 correlation in my tests. Gemini 2.5 Flash is roughly 5x cheaper and still surprisingly good — correlation drops to about 0.7, which is fine for CI-time rough checks but weaker for release-blocking decisions. A fine-tuned smaller model is the endgame if you have hundreds of thousands of labeled examples; I haven't personally had the volume to justify this, but teams I've talked to have gotten correlation above 0.9 with models small enough to run on a single GPU.

The practical rule I follow: use Pro for the 10% sample that feeds into statistical analysis, use Flash for every CI run, and keep one human-graded calibration set of about 100 examples that you run whenever you change the judge prompt itself. That calibration set is cheap to maintain but catches judge drift immediately — if the judge starts scoring a known-good response lower than before, something has changed and you want to know.

How This Composes with Antigravity's Agent Framework

If you're building multi-agent systems on Antigravity, this prompt management platform becomes even more valuable. Each agent typically uses 3-10 distinct prompts — one for planning, several for tool use, one for summarization — and managing them all by feel quickly becomes impossible. Treating each prompt as a separately versioned asset keeps the system understandable.

The concrete pattern I use: each agent declares a manifest of the prompt IDs it consumes. When you spin up an agent, the runtime confirms all its prompt IDs exist in the store and that each has at least one non-deprecated version. This catches deployment mistakes before they reach users. Combined with the evaluator layer, you can also track per-agent quality independently — "the planner prompt regressed but the summarizer is fine," which is the kind of granular signal you need to debug multi-agent failures.

For a deeper walkthrough of multi-agent architecture and how prompts fit into it, see AGENTS.md: Multi-Agent Architecture with Antigravity. The prompt management platform described here slots directly under that structure as the "prompts-as-data" layer.

Another concrete pattern: separate "agent-authored" prompts from "human-authored" prompts. Some of your prompts are static business logic (the planner, the summarizer). Others are generated by the agent itself at runtime (e.g., a tool-use prompt constructed from a schema). Both deserve versioning, but the lifecycle is different — human-authored prompts change in PRs, while agent-authored prompt templates change when the underlying tool schema changes. I tag versions with a source: human | agent field so dashboards can filter them separately. Mixing both into a single bucket makes it nearly impossible to tell whether a quality regression came from a human change or from a schema-driven change.

And remember that Antigravity's agent manager surface lets you inspect agent runs interactively. If you thread your routing key (the one we record in ExecutionRecord) into the trace, you can click through from a misbehaving agent run directly to "this was the prompt text that produced this output." That closed loop — from user complaint back to exact prompt version — is the magic moment that makes the whole platform feel worth the investment.

Closing Thoughts on Team Culture

Infrastructure aside, what ultimately makes prompt management succeed or fail is culture. I've watched teams build beautiful platforms that nobody uses, and I've watched teams run everything on a single YAML file who still ship improvements every week. The difference isn't the tooling — it's the shared belief that prompts matter enough to be treated seriously.

Three cultural nudges that have worked for me. First, celebrate quality score wins alongside feature shipping. When someone's PR raises the average score by 0.4 points, mention it in standup — it signals that prompt work is real work. Second, make the numbers visible to non-engineers. Share weekly score trend screenshots with product and support; they'll start paying attention and asking better questions. Third, accept that some experiments will fail. The whole point of A/B testing is that you get to know. If every experiment wins, you're probably not trying hard enough.

If you take nothing else from this article, take this: a tiny amount of prompt infrastructure changes the conversation from "does this new wording feel better?" to "did quality go up?" That shift alone, more than any specific code pattern here, is what separates teams that ship AI features confidently from teams that ship and pray.

A final note on starting small. Teams sometimes wait to introduce any of this until they have "enough" prompts to justify it. In my experience that threshold never arrives naturally — there's always one more feature to ship first. The decision is actually simpler than it looks: if your application is going to run in production for more than three months and has at least one user-facing prompt, you already have enough justification. The first YAML file and the Pydantic validator take an afternoon. Everything else can grow on top organically, in response to real pain rather than imagined needs. That bias toward starting minimal and adding only what pain demands is, I think, the single most important meta-lesson from this entire article.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.