Pin Your Agent's Output With Golden Snapshots Before Switching Models

When Antigravity's engine moves to Gemini 3.5 Flash, an agent's output can drift silently. This walks through a golden-snapshot regression gate that catches the drift, with the actual test code and a migration-day checklist.

Antigravity²⁷⁹ agent-design⁹ regression² testing¹⁴ model-migration

✦ Premium Article

The morning the engine moved to a new model, the agent received yesterday's prompt and returned output that was slightly different from yesterday. No errors. Logs clean. Yet one tags entry had quietly vanished from the generated front matter.

I run several sites on my own, with a setup where an agent drafts content overnight. If output breaks, I notice. What I fear is output drifting in a direction that is neither clearly better nor worse. Even moving to a model as fast and capable as Gemini 3.5 Flash, this quiet drift is guaranteed.

Why output changes silently during a model migration

The drift travels by three paths.

First, format wobble. Given the same "return JSON" instruction, a new model may change key ordering or how it treats empty arrays. Second, the habit of omission. Smarter models decide "this is obvious, I'll skip it" and drop fields they used to state explicitly. Third, tone. The length of a summary or the firmness of an assertion shifts, falling outside the length range the downstream step assumed.

Each is small on its own, and the test keeps printing "pass." That is exactly why you need to pin the pre-migration output properly, once.

The idea of a golden snapshot

A golden snapshot is output that a human has, at this point in time, certified as correct, saved as a file. From then on you compare the agent's output against this saved answer.

The key is not to aim for an exact string match. Generative output wobbles by nature. What you pin is not the surface string but the structure and invariants the downstream depends on. For example: "the front matter has all seven required keys," "the body has at least six H2s," "internal links point only to articles that actually exist."

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦The three paths by which output drifts during a model migration, and which parts to pin with a snapshot

✦How to write a golden test that judges by structure and invariants rather than exact match, so it survives in production

✦The concrete pin, diff, approve steps to run on the day you switch to Gemini 3.5 Flash

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Deciding the granularity of the snapshot

The first thing to write is a function that normalizes the output. It absorbs the parts allowed to wobble and extracts only the properties you want to pin.

import re
import yaml
 
# Keep required keys and their invariants in one place
REQUIRED_KEYS = ["title", "slug", "category", "level", "premium", "tags", "description"]
 
def normalize(article: str) -> dict:
    """Turn agent output into a wobble-resistant dict for comparison."""
    fm_match = re.match(r"^---\n(.*?)\n---\n(.*)$", article, re.DOTALL)
    if not fm_match:
        raise ValueError("frontmatter missing")
 
    front = yaml.safe_load(fm_match.group(1))
    body = fm_match.group(2)
 
    return {
        "keys_present": sorted(k for k in REQUIRED_KEYS if k in front),
        "tag_count": len(front.get("tags", [])),
        "h2_count": len(re.findall(r"^##\s+", body, re.MULTILINE)),
        "code_blocks": len(re.findall(r"^```", body, re.MULTILINE)) // 2,
        "char_len_bucket": len(body) // 500,  # round into 500-char buckets
    }

Rounding into 500-character buckets, as with char_len_bucket, is the crux. A body changing from 3,000 to 3,200 characters lands in the same bucket, so no meaningless diff fires. But shrink from 3,000 to 1,800 and the bucket changes, so it is caught. You are stating, as a number, how much wobble you allow.

Don't judge the diff by exact match

When comparing the normalized dicts, I vary the tolerance per key. Required-key sets demand an exact match; the character bucket tolerates "within two steps."

def compare(golden: dict, current: dict) -> list[str]:
    """Compare golden and current output, return only the diffs that matter."""
    issues = []
 
    # Missing required keys is an instant fail (downstream automation breaks)
    missing = set(golden["keys_present"]) - set(current["keys_present"])
    if missing:
        issues.append(f"missing required keys: {sorted(missing)}")
 
    # Tag count tolerates +-1 (a model adding/dropping one is fine)
    if abs(golden["tag_count"] - current["tag_count"]) > 1:
        issues.append(f"tag count jumped: {golden['tag_count']} -> {current['tag_count']}")
 
    # Warn if the heading structure thins out (a sign the article got shallow)
    if current["h2_count"] < golden["h2_count"] - 1:
        issues.append(f"H2 dropped: {golden['h2_count']} -> {current['h2_count']}")
 
    # Catch a body shrinking by two buckets (about 1,000 chars) or more
    if golden["char_len_bucket"] - current["char_len_bucket"] >= 2:
        issues.append("body shrank substantially")
 
    return issues

Splitting "wobble to allow" from "regression to forbid" like this makes the gate realistic. A test demanding an exact match dies the third time someone adds # skip. Only tests with built-in tolerance are still alive six months later.

Wiring it into a CI gate

After that, you just save the golden before migration and compare against it after. If you run Antigravity CLI on a schedule, slot this gate right after generation.

import json
from pathlib import Path
 
GOLDEN_DIR = Path("tests/golden")
 
def gate(sample_id: str, output: str, update: bool = False) -> int:
    golden_path = GOLDEN_DIR / f"{sample_id}.json"
    current = normalize(output)
 
    if update or not golden_path.exists():
        golden_path.write_text(json.dumps(current, ensure_ascii=False, indent=2))
        print(f"saved golden: {sample_id}")
        return 0
 
    golden = json.loads(golden_path.read_text())
    issues = compare(golden, current)
    if issues:
        print(f"regression detected in {sample_id}:")
        for i in issues:
            print(f"  - {i}")
        return 1
    print(f"{sample_id} within tolerance")
    return 0

You run with update=True only when a human has reviewed the new output and decided "this is the new correct answer." Never let the agent itself hold update. It will overwrite its own regression as the answer of record.

The steps to run on migration day

The actual switch is safest in this order.

The day before, run 10 to 20 representative inputs through the old model (say Gemini 3.1 Pro) and save the goldens. Too many and you cannot review them, so two or three per category is plenty.
Switch the engine to Gemini 3.5 Flash and regenerate from the same inputs.
Run the gate with update=False and eyeball only the cases that diffed. One tag added is welcome; a required key dropped means you stop the migration.
If all is well, re-save the new output as the answer with update=True.
For the first week after migration, keep this gate on every daily run to catch drift that shows up late.

In my case, moving to 3.5 Flash made it roughly 3x faster in felt speed, but a habit appeared of tacking an extra sentence onto the end of summaries. The "character bucket" could not catch it; it was a failure I noticed later. So now I add the trailing 100-character pattern to the invariants too.

The limits that remain, and how I operate

Golden snapshots are strong against structural regression and weak against meaning decay. A decay where keys, length, and headings are all intact yet the content has gone vague does not show up in the numbers.

So I read, with my own eyes, just one item per category from the output that passed the gate, once a week. The automated gate stops 90% of regressions; a human catches the remaining 10% of "somehow thin." I settled on this two-layer setup.

I reuse the same mechanism for copy updates in the apps I distribute on AdMob. Models will keep being swapped out. Rather than feeling anxious each time, you compare quietly against a pinned answer. Just having that preparation lowers the psychological cost of trying a new model considerably.

I hope this helps. If you also run overnight generation, I would be glad if your next migration felt a little less daunting.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.