Articles/Agents & Manager

◈ Agents & Manager/2026-06-25Advanced

Before a Major Update Silently Breaks Your Overnight Automation — Designing a Staged-Adoption Canary Gate

After a major update dropped my unattended run success rate from about 98% to 63% overnight, I built a staged-adoption gate that freezes the working setup, verifies a new version against a golden output in an isolated profile, and only then adopts it. Here is the design with bash and Python.

antigravity³⁹⁰ agents¹⁰³ automation⁶¹ version-pinning canary

✦ Premium Article

The morning after I pulled in the recent major update, the logs for runs that normally finish quietly were full of unfamiliar failures. About a third of the tasks meant to run overnight had stopped partway, and I spent the morning chasing why.

It was not a bug in my code. The setup that had worked fine the day before had been quietly swapped out by the update. I am an indie developer; I ship wallpaper and relaxation apps to Google Play and the App Store, and I hand a lot of the update work and content operations to Antigravity's agents. So when the ground itself shifts overnight, the results that should have accumulated while I slept fall apart instead.

The wish to use new features quickly and the wish not to break a foundation that runs unattended are always pulling against each other. This is the mechanism I built so I would not have to agonize over that tug-of-war every time: a staged-adoption canary gate, shared in the exact form I run it.

The Morning Only Half the Runs Passed

Let me record what happened precisely. I pulled the update the previous evening, tried a command or two by hand, saw them work, and went to sleep reassured. By morning the first-attempt success rate of unattended runs had fallen from its usual ~98% to about 63%.

What made it tricky was that the failures were not uniform. One task broke because the output format had changed and the downstream step could no longer read it; another was rejected by a quality gate because the response tendencies had shifted. The handful of commands I ran before bed happened to hit paths that were not broken, so they slipped past my check.

The lesson I took away is that a major update is not a single change. One update rewrites several assumptions at once.

The Three Variables an Update Moves at Once

I now think about what breaks in terms of three variables. Each breaks in its own way, so lumping them together as "the update" makes diagnosis slow.

Variable the update moves	Typical way it breaks	What freezing protects
The CLI version itself	Subcommands or output formats change, and downstream parsing breaks	The antigravity version number
Extensions and plugins	Auto-update makes an API incompatible, and behavior shifts silently	A hash of the extension list
The default model	The default is swapped, so the same instruction yields different output	The model.default value

My failure came from the third row (a swapped default model) and the first row (a changed output format) happening together. Either one alone might have been obvious, but overlapped, the symptoms mixed and the diagnosis got hard.

That is exactly why it is worth keeping the before-and-after of an update in a form a machine can explain, instead of leaving recovery to luck.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦From a case where unattended success dropped from about 98% to 63% after an update, learn to separate what breaks into three distinct variables

✦A bash and Python implementation that freezes the working CLI, extensions, and default model into an env.lock.json so any update can be rolled back in one step

✦A canary verification runner that compares against a golden output and uses an exit-code contract to gate adoption into your automated pipeline

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

The Core Idea Is "Freeze, Then Update"

The design philosophy here is plain. Freeze the working setup first, make that frozen point restorable at any moment, and only then try the new version somewhere else. Adopt it into production's unattended runs only once you have confirmed it behaves as it did before.

Put differently: turn an update from an irreversible one-shot into a reversible procedure that has a pass/fail result. As an indie developer I cannot pour staff hours into a verification environment. So having two things ready first — reversibility, and a pass/fail a machine can decide — is what lets me trust the nightly runs.

The Procedure — A Staged-Adoption Canary Gate

The actual flow has four stages. Walk them in order and there is nothing to agonize over at each decision.

Freeze the working setup into a manifest, keeping one prior generation as a fallback.
Install the new version into a profile isolated from production. Do not touch the production settings.
Run a representative task once in the isolated profile and compare it against a golden output.
If the similarity score clears the threshold, adopt; if not, roll back automatically to the frozen version.

The heart of these four stages is the third, the golden comparison. We do not ask whether the new version got faster or smarter, only whether it behaves the way it used to. For unattended runs, getting the same result today as yesterday matters more than sharp brilliance.

Implementation 1: Freeze the Working Setup

First, lock the currently working setup into a single file. The point is to record not just the version number but the hash of the extension list and the default model alongside it. Capturing all three variables together lets you later explain what moved as a diff.

#!/usr/bin/env bash
# capture-env.sh — freeze a verified, working setup
set -euo pipefail
OUT="env.lock.json"
ag_version="$(antigravity --version 2>/dev/null | head -1)"
node_version="$(node -v)"
default_model="$(antigravity config get model.default 2>/dev/null || echo unknown)"
ext_hash="$(antigravity ext list --json | sha256sum | cut -c1-16)"
cat > "$OUT" <<JSON
{
  "captured_at": "$(date -u +%FT%TZ)",
  "antigravity": "${ag_version}",
  "node": "${node_version}",
  "default_model": "${default_model}",
  "extensions_sha": "${ext_hash}"
}
JSON
echo "frozen -> $OUT"

Keep this file in your repository and the history of updates lives on as a git diff. Once I started committing env.lock.json, I could trace "when did what change," and the diagnosis mornings got much shorter.

Implementation 2: The Canary Verification Runner

Next, run the new version once in the isolated profile and compare it against the previous output. Demanding an exact match would fail you on every tiny fluctuation, so judge "did the behavior change" with a similarity threshold.

# canary.py — try the new version in an isolated profile, return pass/fail via exit code
import json, subprocess, sys, difflib
 
GOLDEN = "canary/golden_output.txt"
TASK   = "canary/task_prompt.txt"
 
def run_on(profile: str) -> str:
    out = subprocess.run(
        ["antigravity", "run", "--profile", profile, "--prompt-file", TASK],
        capture_output=True, text=True, timeout=900,
    )
    if out.returncode != 0:
        raise RuntimeError(f"run failed: {out.stderr[:200]}")
    return out.stdout.strip()
 
def similarity(a: str, b: str) -> float:
    return difflib.SequenceMatcher(None, a, b).ratio()
 
def main() -> int:
    golden = open(GOLDEN, encoding="utf-8").read().strip()
    actual = run_on("canary-next")
    score = similarity(golden, actual)
    print(json.dumps({"similarity": round(score, 3)}))
    # below 0.92, treat behavior as changed and block adoption
    return 0 if score >= 0.92 else 1
 
if __name__ == "__main__":
    sys.exit(main())

The 0.92 threshold is my own value; tune it to how noisy your tasks are. If the output mixes in dates or random values, mask those parts before comparing to cut false positives. Returning pass/fail through the exit code alone means a downstream shell can branch on it directly.

Implementation 3: Watch Health and Pin Back Automatically

Finally, split adoption and rollback by the canary result. Self-update only on a pass, then re-freeze the new setup. On a fail, simply re-pin to the frozen version and you are back where you started.

#!/usr/bin/env bash
# adopt-or-rollback.sh — adopt only on a canary pass, otherwise return to the frozen version
set -euo pipefail
if python3 canary.py; then
  cp env.lock.json env.lock.prev.json     # keep one prior generation
  antigravity self-update --channel stable
  ./capture-env.sh                         # re-freeze the new setup
  echo "adopted"
else
  pinned="$(jq -r .antigravity env.lock.json)"
  antigravity self-update --pin "$pinned"
  echo "rolled back to ${pinned}"
fi

I run these three scripts once, in the window before the unattended runs begin. Whether or not there is an update, every cycle goes through "freeze, canary, adopt or roll back," so the foundation can be swapped safely even while no one is awake. I recommend placing this about two hours before the nightly tasks start, so that even if verification surfaces a problem, there is slack before production begins.

The Numbers and Pitfalls I Saw in Practice

Here is the result after running this for a while. The numbers are measured in my own environment and will vary with the number and kind of tasks you run.

Metric	Right after an update, no measures	After the canary gate
First-attempt unattended success rate	about 63%	about 98%
Time to notice something is wrong	about 8 hours, not until next morning	0 minutes, caught before adoption
Time to return to the frozen version	about 40 minutes, by hand	about 12 minutes, automatic

More than the success rate itself, what mattered to me was that the time to notice trouble went from "next morning" to zero. It cut off the worst path: spending a whole night with a broken setup.

A few pitfalls worth recording. First, if you leave extension auto-update on, freezing the CLI still lets the extensions move behind your back. I turned extension auto-update off and now update them explicitly, together with updates to env.lock.json. Second, if you create a golden output once and forget it, it will keep failing even legitimate spec changes. Put a step on the adopt side that always refreshes the golden from the newly adopted output, so verification does not go stale. Third, an isolated profile can accidentally read production's settings file. Separate the profile's config directory fully via environment variables and avoid handing it production credentials.

The more a task feeds directly into the next day's numbers — like AdMob revenue aggregation or release prep for Google Play — the more these two things, reversibility and a clear pass/fail, pay off.

Where to Start

You do not need all three scripts from day one. The first step I recommend is to run only capture-env.sh starting today and commit env.lock.json to git. That alone lets you pull out "what was it before" as a diff when the next major update arrives. The canary and the automatic rollback can be added once that foundation exists.

A stable foundation is not a flashy result, but at the scale of running things unattended, it matters more than anything. If it brings a little peace of mind to your nights as someone running several things alone, I would be glad.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.