Does the New CLI Do the Same Job? An Output-Parity Gate Before Switching to Antigravity CLI
With Gemini CLI shutting down on June 18, here is how I froze the old CLI's artifacts as a golden baseline and built a parity harness to catch regressions before cutting over to Antigravity CLI — with normalization and a go/no-go gate, in code.
On June 18, Gemini CLI and the Gemini Code Assist IDE extension stop serving requests, and everything moves to the Go-based Antigravity CLI. If your automation was built around Gemini CLI, the real fear is not whether the command runs. You can confirm that in a couple of minutes. The fear is that it runs and quietly produces a slightly different result.
I run record generation and release prep for several sites across my indie projects through CLI-driven agents. Confirming that agy --version succeeds guarantees nothing. What I actually want to know is whether the draft structure has shifted, whether the release-note format still holds, whether asset naming follows the same rule.
So a few days before the deadline, I froze the old CLI's output as a "golden" baseline and built a gate that mechanically compares it against the new CLI. This article records the design of that parity harness, how I normalized nondeterministic output, and how I decided whether to stop or proceed — with the actual code.
"It runs" and "it produces the same result" are different claims
Migration checklists usually end at a startup check: is the binary installed, does auth pass, do the subcommands exist. Those are preconditions, not regression detection.
An agent CLI's output can change in model, prompt interpretation, and execution plan all at once. Antigravity CLI shares the same agent harness as Antigravity 2.0 desktop and is powered by Gemini 3.5 Flash — reportedly about 4x faster than competing frontier models. But being fast is unrelated to "making the same decisions as before." In fact, speed introduces a new risk: a large pile of artifacts accumulates before a human reads any of them.
That is why the unit of verification should be the final artifact, not the command's exit status — the draft itself, the build output, the files that get committed. The thing I, as both reader and operator, ultimately hold in my hands.
Compare only the final artifacts
My first mistake was comparing the full standard output. An agent's stdout mixes progress logs, reasoning summaries, and timings, so the diff is hundreds of lines every run. Regressions drown in it.
Limiting comparison to three axes made it stable.
The three axes to compare
Generated file contents: the drafts, configs, and code the agent writes. This is the heart of it.
Exit code and the set of side effects: whether expected files were created or not. This catches silent failures where empty output sails through with exit code 0.
Structural metadata: for a draft, the heading count, code-block count, and the set of frontmatter keys. The skeleton, not every word.
I keep the stdout logs out of the comparison and save them for debugging only. Even in production publishing, I have seen "an empty page published with exit code 0," so the second axis — verifying side effects independently — is non-negotiable.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦How to freeze old Gemini CLI artifacts as a golden baseline and build a harness that flags a regression the instant it appears after cutover
✦A three-stage normalization that strips timestamps, run IDs, and model phrasing so only real regressions survive the diff
✦A go/no-go gate that separates mechanical diffs from semantic ones to decide whether to stop or proceed with the switch
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
This is the one step you can only do before cutover. While Gemini CLI still works, run representative tasks and save the artifacts.
#!/usr/bin/env bash# capture_golden.sh — freeze old-CLI artifacts as the golden baselineset -euo pipefailGOLDEN_DIR="golden/$(date +%Y%m%d)"TASKS_DIR="parity_tasks" # one prompt file per representative taskmkdir -p "$GOLDEN_DIR"for task in "$TASKS_DIR"/*.txt; do name="$(basename "$task" .txt)" out="$GOLDEN_DIR/$name" mkdir -p "$out/files" # Run the old CLI non-interactively; isolate writes under out/files gemini run --prompt-file "$task" --workdir "$out/files" \ > "$out/stdout.log" 2> "$out/stderr.log" || true echo "$?" > "$out/exit_code" # Record the file list and content hashes ( cd "$out/files" && find . -type f | sort > "../manifest.txt" ) ( cd "$out/files" && find . -type f -exec sha256sum {} \; | sort > "../hashes.txt" ) echo "captured: $name"doneecho "golden fixed at $GOLDEN_DIR"
What matters here is choosing representative tasks (parity_tasks/) that are close to real workloads. Toy examples never surface regressions. In my case I picked three families: "draft one article," "format a release note," and "propose asset renames." Too many and you cannot run them all on cutover day, so narrowing to a few that actually bite is the practical move.
Normalize the nondeterministic diffs
This is the core of the harness. LLM output never matches byte-for-byte. Timestamps, run IDs, slight word order, punctuation drift — unless you erase these meaningless diffs, the regression signal sinks into noise.
Splitting normalization into three stages kept it readable.
# normalize.py — reduce an artifact to a comparable canonical formimport reimport jsonfrom pathlib import Path# (1) Mask tokens that always change mechanicallyVOLATILE = [ (re.compile(r"\d{4}-\d{2}-\d{2}T[\d:]+(?:Z|[+\-]\d{2}:?\d{2})?"), "<TS>"), (re.compile(r"\brun[-_][0-9a-f]{6,}\b", re.I), "<RUN_ID>"), (re.compile(r"\b[0-9a-f]{40}\b"), "<SHA>"), (re.compile(r"\b\d+(?:\.\d+)?\s*(ms|s|sec)\b"), "<DUR>"),]def normalize_text(text: str) -> str: for pat, repl in VOLATILE: text = pat.sub(repl, text) # (2) Trim trailing whitespace and collapse blank-line runs text = "\n".join(line.rstrip() for line in text.splitlines()) text = re.sub(r"\n{3,}", "\n\n", text) return text.strip()def structural_signature(md: str) -> dict: # (3) Extract the skeleton, not the prose; wording can drift, skeleton holds body = re.sub(r"^---\n.*?\n---\n", "", md, flags=re.DOTALL) return { "h2": len(re.findall(r"^##\s", body, re.M)), "h3": len(re.findall(r"^###\s", body, re.M)), "code_blocks": body.count("```") // 2, "links": len(re.findall(r"\]\(", body)), "frontmatter_keys": sorted( re.findall(r"^(\w+):", md.split("---")[1], re.M) ) if "---" in md else [], }if __name__ == "__main__": import sys, hashlib p = Path(sys.argv[1]) raw = p.read_text(encoding="utf-8") print(json.dumps({ "normalized_sha": hashlib.sha256( normalize_text(raw).encode()).hexdigest()[:16], "structure": structural_signature(raw), }, ensure_ascii=False, indent=2))
The mask list (VOLATILE) is best grown while watching real diffs. Aiming for perfection up front erases genuine regressions along with the noise. I started minimal and added only what I was certain "always changes."
Separate mechanical diffs from semantic ones
Even after normalization, old and new will not match exactly, because Gemini 3.5 Flash phrases things differently from the old model. So I handle diffs in two layers.
If the skeleton (structural_signature) matches, I treat the artifact as practically equivalent even when the prose differs. Conversely, if heading counts, code-block counts, or the frontmatter key set change, that is a structural regression and a reason to stop.
# parity_gate.py — decide go / no-goimport json, sys, subprocessfrom pathlib import Pathdef sig(path): out = subprocess.check_output(["python3", "normalize.py", str(path)]) return json.loads(out)def compare(golden_dir: Path, candidate_dir: Path) -> list[str]: findings = [] for gfile in (golden_dir / "files").rglob("*"): if not gfile.is_file(): continue rel = gfile.relative_to(golden_dir / "files") cfile = candidate_dir / "files" / rel if not cfile.exists(): findings.append(f"BLOCK missing: {rel} was not generated by the new CLI") continue g, c = sig(gfile), sig(cfile) if g["structure"] != c["structure"]: findings.append(f"BLOCK structure: {rel} {g['structure']} -> {c['structure']}") elif g["normalized_sha"] != c["normalized_sha"]: findings.append(f"REVIEW prose: {rel} (skeleton matches; human check advised)") return findingsif __name__ == "__main__": findings = compare(Path(sys.argv[1]), Path(sys.argv[2])) blocks = [f for f in findings if f.startswith("BLOCK")] for f in findings: print(f) # Any BLOCK stops the cutover sys.exit(1 if blocks else 0)
I limit BLOCK to structural regressions and missing artifacts, and route prose drift to REVIEW for a human. If everything blocks, the gate stays red forever on cutover day and you end up waving it through anyway. Strictly narrowing what should stop you is the condition for keeping the gate useful.
On cutover day, start with a canary
Once the gate is in place, stage the day's work.
Canary: run the single most painful-to-break task on the new CLI and put it through the gate.
Full parity run: if the canary is green, run the remaining representative tasks and pass them through parity_gate.py.
Production switch: eyeball the REVIEW diffs, and if acceptable, point the automation at the new CLI.
Keep a rollback line: keep the old CLI binary and config around for a while after shutdown. Requests stop after June 18, but it remains valuable as a reference for "how it used to behave," not for re-capturing golden.
I tried this on low-traffic material first, then widened it to release prep tied to AdMob revenue and to App Store submission artifacts. Widening in order of "least damage if it breaks" — the reverse of "most painful first" — is the safe path.
A one-off task that becomes an asset
This harness looks like it exists only for the June 18 migration, but it stays. Agent CLIs and models will keep updating. Having a mechanism that asks "is it doing the same job as before" on hand each time lets you accept updates without fear.
As a next step, drop just two or three of your real tasks into parity_tasks/ and run capture_golden.sh once while the old CLI still works. As long as the golden is captured, you can step into the cutover with verification at any time. Rather than waving things through under deadline pressure, compare quietly and then proceed — that margin is the insurance for anyone juggling several automations in indie development.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.