Does the New CLI Do the Same Job? An Output-Parity Gate Before Switching to Antigravity CLI

With Gemini CLI shutting down on June 18, here is how I froze the old CLI's artifacts as a golden baseline and built a parity harness to catch regressions before cutting over to Antigravity CLI — with normalization and a go/no-go gate, in code.

Antigravity CLI⁶ Gemini CLI¹¹ migration⁸ CI² automation⁴⁷

✦ Premium Article

On June 18, Gemini CLI and the Gemini Code Assist IDE extension stop serving requests, and everything moves to the Go-based Antigravity CLI. If your automation was built around Gemini CLI, the real fear is not whether the command runs. You can confirm that in a couple of minutes. The fear is that it runs and quietly produces a slightly different result.

I run record generation and release prep for several sites across my indie projects through CLI-driven agents. Confirming that agy --version succeeds guarantees nothing. What I actually want to know is whether the draft structure has shifted, whether the release-note format still holds, whether asset naming follows the same rule.

So a few days before the deadline, I froze the old CLI's output as a "golden" baseline and built a gate that mechanically compares it against the new CLI. This article records the design of that parity harness, how I normalized nondeterministic output, and how I decided whether to stop or proceed — with the actual code.

"It runs" and "it produces the same result" are different claims

Migration checklists usually end at a startup check: is the binary installed, does auth pass, do the subcommands exist. Those are preconditions, not regression detection.

An agent CLI's output can change in model, prompt interpretation, and execution plan all at once. Antigravity CLI shares the same agent harness as Antigravity 2.0 desktop and is powered by Gemini 3.5 Flash — reportedly about 4x faster than competing frontier models. But being fast is unrelated to "making the same decisions as before." In fact, speed introduces a new risk: a large pile of artifacts accumulates before a human reads any of them.

That is why the unit of verification should be the final artifact, not the command's exit status — the draft itself, the build output, the files that get committed. The thing I, as both reader and operator, ultimately hold in my hands.

Compare only the final artifacts

My first mistake was comparing the full standard output. An agent's stdout mixes progress logs, reasoning summaries, and timings, so the diff is hundreds of lines every run. Regressions drown in it.

Limiting comparison to three axes made it stable.

The three axes to compare

Generated file contents: the drafts, configs, and code the agent writes. This is the heart of it.
Exit code and the set of side effects: whether expected files were created or not. This catches silent failures where empty output sails through with exit code 0.
Structural metadata: for a draft, the heading count, code-block count, and the set of frontmatter keys. The skeleton, not every word.

I keep the stdout logs out of the comparison and save them for debugging only. Even in production publishing, I have seen "an empty page published with exit code 0," so the second axis — verifying side effects independently — is non-negotiable.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How to freeze old Gemini CLI artifacts as a golden baseline and build a harness that flags a regression the instant it appears after cutover

✦A three-stage normalization that strips timestamps, run IDs, and model phrasing so only real regressions survive the diff

✦A go/no-go gate that separates mechanical diffs from semantic ones to decide whether to stop or proceed with the switch

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Freeze the golden baseline with the old CLI first

This is the one step you can only do before cutover. While Gemini CLI still works, run representative tasks and save the artifacts.

#!/usr/bin/env bash
# capture_golden.sh — freeze old-CLI artifacts as the golden baseline
set -euo pipefail
 
GOLDEN_DIR="golden/$(date +%Y%m%d)"
TASKS_DIR="parity_tasks"   # one prompt file per representative task
mkdir -p "$GOLDEN_DIR"
 
for task in "$TASKS_DIR"/*.txt; do
  name="$(basename "$task" .txt)"
  out="$GOLDEN_DIR/$name"
  mkdir -p "$out/files"
 
  # Run the old CLI non-interactively; isolate writes under out/files
  gemini run --prompt-file "$task" --workdir "$out/files" \
    > "$out/stdout.log" 2> "$out/stderr.log" || true
  echo "$?" > "$out/exit_code"
 
  # Record the file list and content hashes
  ( cd "$out/files" && find . -type f | sort > "../manifest.txt" )
  ( cd "$out/files" && find . -type f -exec sha256sum {} \; | sort > "../hashes.txt" )
  echo "captured: $name"
done
 
echo "golden fixed at $GOLDEN_DIR"

What matters here is choosing representative tasks (parity_tasks/) that are close to real workloads. Toy examples never surface regressions. In my case I picked three families: "draft one article," "format a release note," and "propose asset renames." Too many and you cannot run them all on cutover day, so narrowing to a few that actually bite is the practical move.

Normalize the nondeterministic diffs

This is the core of the harness. LLM output never matches byte-for-byte. Timestamps, run IDs, slight word order, punctuation drift — unless you erase these meaningless diffs, the regression signal sinks into noise.

Splitting normalization into three stages kept it readable.

# normalize.py — reduce an artifact to a comparable canonical form
import re
import json
from pathlib import Path
 
# (1) Mask tokens that always change mechanically
VOLATILE = [
    (re.compile(r"\d{4}-\d{2}-\d{2}T[\d:]+(?:Z|[+\-]\d{2}:?\d{2})?"), "<TS>"),
    (re.compile(r"\brun[-_][0-9a-f]{6,}\b", re.I), "<RUN_ID>"),
    (re.compile(r"\b[0-9a-f]{40}\b"), "<SHA>"),
    (re.compile(r"\b\d+(?:\.\d+)?\s*(ms|s|sec)\b"), "<DUR>"),
]
 
def normalize_text(text: str) -> str:
    for pat, repl in VOLATILE:
        text = pat.sub(repl, text)
    # (2) Trim trailing whitespace and collapse blank-line runs
    text = "\n".join(line.rstrip() for line in text.splitlines())
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()
 
def structural_signature(md: str) -> dict:
    # (3) Extract the skeleton, not the prose; wording can drift, skeleton holds
    body = re.sub(r"^---\n.*?\n---\n", "", md, flags=re.DOTALL)
    return {
        "h2": len(re.findall(r"^##\s", body, re.M)),
        "h3": len(re.findall(r"^###\s", body, re.M)),
        "code_blocks": body.count("```") // 2,
        "links": len(re.findall(r"\]\(", body)),
        "frontmatter_keys": sorted(
            re.findall(r"^(\w+):", md.split("---")[1], re.M)
        ) if "---" in md else [],
    }
 
if __name__ == "__main__":
    import sys, hashlib
    p = Path(sys.argv[1])
    raw = p.read_text(encoding="utf-8")
    print(json.dumps({
        "normalized_sha": hashlib.sha256(
            normalize_text(raw).encode()).hexdigest()[:16],
        "structure": structural_signature(raw),
    }, ensure_ascii=False, indent=2))

The mask list (VOLATILE) is best grown while watching real diffs. Aiming for perfection up front erases genuine regressions along with the noise. I started minimal and added only what I was certain "always changes."

Separate mechanical diffs from semantic ones

Even after normalization, old and new will not match exactly, because Gemini 3.5 Flash phrases things differently from the old model. So I handle diffs in two layers.

If the skeleton (structural_signature) matches, I treat the artifact as practically equivalent even when the prose differs. Conversely, if heading counts, code-block counts, or the frontmatter key set change, that is a structural regression and a reason to stop.

# parity_gate.py — decide go / no-go
import json, sys, subprocess
from pathlib import Path
 
def sig(path):
    out = subprocess.check_output(["python3", "normalize.py", str(path)])
    return json.loads(out)
 
def compare(golden_dir: Path, candidate_dir: Path) -> list[str]:
    findings = []
    for gfile in (golden_dir / "files").rglob("*"):
        if not gfile.is_file():
            continue
        rel = gfile.relative_to(golden_dir / "files")
        cfile = candidate_dir / "files" / rel
        if not cfile.exists():
            findings.append(f"BLOCK missing: {rel} was not generated by the new CLI")
            continue
        g, c = sig(gfile), sig(cfile)
        if g["structure"] != c["structure"]:
            findings.append(f"BLOCK structure: {rel} {g['structure']} -> {c['structure']}")
        elif g["normalized_sha"] != c["normalized_sha"]:
            findings.append(f"REVIEW prose: {rel} (skeleton matches; human check advised)")
    return findings
 
if __name__ == "__main__":
    findings = compare(Path(sys.argv[1]), Path(sys.argv[2]))
    blocks = [f for f in findings if f.startswith("BLOCK")]
    for f in findings:
        print(f)
    # Any BLOCK stops the cutover
    sys.exit(1 if blocks else 0)

I limit BLOCK to structural regressions and missing artifacts, and route prose drift to REVIEW for a human. If everything blocks, the gate stays red forever on cutover day and you end up waving it through anyway. Strictly narrowing what should stop you is the condition for keeping the gate useful.

On cutover day, start with a canary

Once the gate is in place, stage the day's work.

Canary: run the single most painful-to-break task on the new CLI and put it through the gate.
Full parity run: if the canary is green, run the remaining representative tasks and pass them through parity_gate.py.
Production switch: eyeball the REVIEW diffs, and if acceptable, point the automation at the new CLI.
Keep a rollback line: keep the old CLI binary and config around for a while after shutdown. Requests stop after June 18, but it remains valuable as a reference for "how it used to behave," not for re-capturing golden.

I tried this on low-traffic material first, then widened it to release prep tied to AdMob revenue and to App Store submission artifacts. Widening in order of "least damage if it breaks" — the reverse of "most painful first" — is the safe path.

A one-off task that becomes an asset

This harness looks like it exists only for the June 18 migration, but it stays. Agent CLIs and models will keep updating. Having a mechanism that asks "is it doing the same job as before" on hand each time lets you accept updates without fear.

As a next step, drop just two or three of your real tasks into parity_tasks/ and run capture_golden.sh once while the old CLI still works. As long as the golden is captured, you can step into the cutover with verification at any time. Rather than waving things through under deadline pressure, compare quietly and then proceed — that margin is the insurance for anyone juggling several automations in indie development.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.