Regression-Testing Antigravity Agent Output in CI

Agent output drifts between identical runs and turns CI red for no real reason. Here is how I stabilized snapshot regression testing for Antigravity agents using a normalization layer and pytest golden files, drawn from running it in my own indie developer CI.

antigravity³⁵⁰ ci testing⁹ agent¹⁴ pytest

✦ Premium Article

One morning, an agent I had moved onto a schedule produced something subtly off. The prompt was identical to the day before, yet the output had shifted. Running it twice locally gave me two different diffs. My CI snapshot test went red, as expected, but that red told me nothing: was it broken, or had it just drifted?

When you are an indie developer handing several sites to agents, this "drifting red" is the worst kind. It quietly hides real regressions. Here is the setup I built to regression-test Antigravity agent output reliably in CI, step by step.

Two identical runs, two different diffs

My first attempt was the naive one: save the output to a file and compare with git diff. That collapsed within half a day.

Agent output always contains fragments that are semantically equivalent but textually different every time: generation timestamps, run IDs, temp file paths, list ordering, JSON key order. Compare them raw, and a meaningful regression sits in the same diff pile as meaningless drift.

So the problem was never "comparing." It was "removing drift before comparing."

Why agent output fights snapshot testing

Snapshot testing itself is well proven for UI components and API responses. You record an expected value once, then assert against it on later runs.

It looks like a bad fit for agents only because three kinds of non-determinism live together in the output. The first is environmental drift (time, IDs, paths). The second is ordering drift (a set returned as an array). The third is model paraphrasing (the same intent in different words).

The first two are mechanically removable. Only the third needs a different matching strategy. Without this separation, you jump to the wrong conclusion that "snapshots are impossible." In reality you flatten what is flattenable, then apply meaning-based checks only to what remains.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A concrete pytest workflow for regression-testing non-deterministic agent output

✦A normalization layer that flattens drifting values like timestamps and UUIDs

✦Operational tactics to cut CI flake rate with 3 retries and a 5% diff threshold

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

A normalization layer that flattens only the noise

So I insert a normalization layer just before saving and comparing. Its single job is to rewrite environmental and ordering drift into a deterministic form.

import re
import json
 
# Replace environmental drift with stable tokens
NORMALIZERS = [
    (re.compile(r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?Z?"), "<TIMESTAMP>"),
    (re.compile(r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"), "<UUID>"),
    (re.compile(r"/tmp/[^\s\"']+"), "<TMPPATH>"),
    (re.compile(r"run_id=\w+"), "run_id=<RUN_ID>"),
]
 
def normalize_text(text: str) -> str:
    for pattern, token in NORMALIZERS:
        text = pattern.sub(token, text)
    return text.strip()
 
def normalize_json(payload: dict) -> str:
    # Fix key order and sort dicts inside arrays by stable keys
    canonical = json.dumps(payload, ensure_ascii=False, sort_keys=True, indent=2)
    return normalize_text(canonical)

The key is keeping normalization a test-only preprocessor. The agent itself stays untouched. Production output is saved as-is; only the comparison looks at a drift-flattened shadow. Break this separation and you start distorting production output for the sake of testing, which makes regressions easier to miss, not harder.

Golden-file comparison with pytest

Comparing two normalized shadows fits cleanly into pytest. Make it record-or-assert with an update flag: on the first run there is no golden file, so record; afterward, assert.

import os
from pathlib import Path
 
GOLDEN_DIR = Path(__file__).parent / "golden"
UPDATE = os.environ.get("UPDATE_GOLDEN") == "1"
 
def assert_against_golden(name: str, actual_raw: str):
    GOLDEN_DIR.mkdir(exist_ok=True)
    golden_path = GOLDEN_DIR / f"{name}.txt"
    actual = normalize_text(actual_raw)
 
    if UPDATE or not golden_path.exists():
        golden_path.write_text(actual, encoding="utf-8")
        return  # record mode: do not assert
 
    expected = golden_path.read_text(encoding="utf-8")
    assert actual == expected, (
        f"snapshot regression: {name}\n"
        f"--- expected ---\n{expected[:400]}\n"
        f"--- actual ---\n{actual[:400]}"
    )

Commit the golden files to Git. When a diff appears at review time, it is a record that you intentionally updated the expectation. Run UPDATE_GOLDEN=1 pytest to refresh, then review that diff by hand before merging. This round trip guarantees regressions surface as a pull-request diff.

Match structured output by meaning

The third kind of drift that survives normalization, model paraphrasing, cannot be caught by string equality. The shortcut here is Antigravity structured outputs: have the agent return schema-shaped JSON instead of free text.

Instead of comparing free text, freeze only the required fields of the extracted structure as a contract.

def assert_structured(actual: dict, contract: dict):
    # Verify only key presence and type as the contract
    for key, expected_type in contract.items():
        assert key in actual, f"missing required field: {key}"
        assert isinstance(actual[key], expected_type), (
            f"type mismatch: {key} expected {expected_type}"
        )
    # Snapshot the shape, not the prose
    shape = {k: type(v).__name__ for k, v in sorted(actual.items())}
    return shape
 
CONTRACT = {"title": str, "tags": list, "summary": str, "score": (int, float)}

The wording may drift, but if the contract that title is a string, tags is an array, and score is a number breaks, that is a regression. Guard the skeleton of the output, not every character. This was the realistic landing spot for meaning-based matching.

Cutting CI flake: retries and a diff threshold

Some free text still cannot be eliminated, such as a description body. Failing immediately on a single mismatch there turns CI red for nothing.

For this layer only, I add two cushions. One is up to 3 retries to see whether the drift converges. The other is a diff threshold: if the normalized edit distance is under 5% of the whole, I allow it.

from difflib import SequenceMatcher
 
def soft_match(actual_raw: str, expected: str, tolerance: float = 0.05) -> bool:
    actual = normalize_text(actual_raw)
    ratio = SequenceMatcher(None, actual, expected).ratio()
    drift = 1.0 - ratio
    return drift <= tolerance  # tolerate drift within 5%

Numbers strictly, structure by contract, free text by threshold: once I split it into these three tiers, CI false positives dropped sharply. Weekly flake reruns that used to number in the dozens settled into the single digits, in my own operational experience. I log retry counts and thresholds, and treat any test that retries often as a marker for "add more contract here."

Pitfalls I hit in operation

The worst trap was over-normalizing until it absorbed a real regression. After I flattened entire paths to <TMPPATH>, a genuine bug, a wrong output directory, stopped showing up in the diff. Break the rule that normalization targets only content-irrelevant parts, and the test goes silent.

The second was lazy golden updates. UPDATE_GOLDEN=1 is so convenient that if you get into the habit of updating without reading the diff, the snapshot becomes a rubber stamp rather than a record. I switched to keeping update commits separate from content changes, so the diff always lands in review.

A next step

Start by writing down three places in your agent output that change every run despite being content-irrelevant. One of time, ID, or path will always qualify. Register those three in the normalization layer and get a single golden-file test green. From there, the feel for separating drifting red from real red comes quickly.

Thank you for reading. If you run several agents in parallel, I hope this helps keep your CI quietly green.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.