"The nightly batch that ran fine yesterday came up empty this morning." During the June week when Antigravity shipped v2.2.1, v2.1.4, and v2.0.11 back to back, I ran straight into exactly this. Interactively, nothing looked different — but my scheduled run, which calls agy headless, was failing downstream because a subtle change in the output format broke the parser.
Reading the official changelog every morning isn't realistic, and even when you do, it won't tell you whether a change affects your script. What you need is a mechanism that decides, mechanically and quickly, whether the output of your own use case changed across an update. Below, we build a lightweight smoke test that detects regressions against a golden output and automatically reverts to the previous version when something breaks.
Why the breakage is so quiet
Point releases are dangerous precisely because you don't brace for them, unlike a major upgrade. They're mostly bug fixes and performance work, so it never occurs to you that your automation might stop. Yet what matters in unattended operation isn't the presence of features but the fine details of the output.
Here's what actually bit me: a progress line appeared at the top of the headless output, JSON formatting shifted so the trailing newline came and went, and the exit code stayed 0 while the body came back empty. None of these are noticeable in the interactive UI. But for a downstream stage consuming the output with jq or a regex, they're fatal. If you judge success by exit code alone, you'll pass an empty body forward as a "success."
So the boundary to defend is not "did the command succeed" but "did output come back in the usual shape." Frame it that way and the fix becomes far more concrete.
Capture a golden output once
The first thing to do is record, just once, the output of a representative input on the "current version" that runs stably. Demanding an exact match breaks on dates and random values every time, so normalize the volatile parts before comparing.
#!/usr/bin/env bash
# capture_golden.sh — save a baseline from the version that works today
set -euo pipefail
GOLDEN_DIR="${HOME}/.agy_smoke"
mkdir -p "$GOLDEN_DIR"
# reproduce the same headless call as production (swap to match your batch)
run_case() {
agy run --headless --no-tty \
--prompt-file "$1" \
--output json 2>/dev/null
}
# mask volatile values (dates, UUIDs, elapsed time) so comparison is stable
normalize() {
sed -E \
-e 's/[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9:.]+Z?/<TS>/g' \
-e 's/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/<UUID>/g' \
-e 's/"elapsed_ms":[0-9]+/"elapsed_ms":<N>/g'
}
agy --version > "$GOLDEN_DIR/version.txt"
for f in cases/*.prompt; do
name="$(basename "$f" .prompt)"
run_case "$f" | normalize > "$GOLDEN_DIR/${name}.golden"
echo "captured: $name"
done
echo "golden version: $(cat "$GOLDEN_DIR/version.txt")"The important thing is to make the smoke inputs cases/*.prompt the same kind as your production batch. I keep only three representative cases — article generation, summarization, and JSON extraction. I don't aim for coverage; the policy is to thinly defend only the paths that hurt most if they break. Too many cases make each check heavy, and you end up never running it.
Smoke before the update, roll back if it fails
With a baseline in place, run the same cases again right before an update and compare against the golden. If a structural difference appears, treat it as a likely sign that the update breaks your use case, and stop.
#!/usr/bin/env bash
# smoke_check.sh — diff current output against the golden. exit 1 on drift
set -euo pipefail
GOLDEN_DIR="${HOME}/.agy_smoke"
run_case() { agy run --headless --no-tty --prompt-file "$1" --output json 2>/dev/null; }
normalize() {
sed -E \
-e 's/[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9:.]+Z?/<TS>/g' \
-e 's/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/<UUID>/g' \
-e 's/"elapsed_ms":[0-9]+/"elapsed_ms":<N>/g'
}
fail=0
for g in "$GOLDEN_DIR"/*.golden; do
name="$(basename "$g" .golden)"
current="$(run_case "cases/${name}.prompt" | normalize)"
# 1) body must be non-empty (reject empty returns even on exit code 0)
if [ -z "$(echo "$current" | jq -r '.text // empty' 2>/dev/null)" ]; then
echo "✗ ${name}: empty body"
fail=1; continue
fi
# 2) structural diff (detect key-set changes; ignore value contents)
ref_keys="$(cat "$g" | jq -S 'keys' 2>/dev/null || echo '[]')"
cur_keys="$(echo "$current" | jq -S 'keys' 2>/dev/null || echo '[]')"
if [ "$ref_keys" != "$cur_keys" ]; then
echo "✗ ${name}: schema drift"
diff <(echo "$ref_keys") <(echo "$cur_keys") || true
fail=1; continue
fi
echo "✓ ${name}"
done
exit "$fail"There's a reason I narrow the comparison to "key-set changes." The body text naturally drifts with each update, so checking for an exact match turns the smoke red every time and erodes trust in it. What the downstream parser actually depends on is the output's structure — which keys exist — so I'm strict only there. The empty-body check sits separately to catch the quietest failure mode of all: exit code 0 with empty content.
Make update and verification one flow
Finally, fold version pinning, the update, the smoke, and automatic rollback into one script. You're turning the rule of thumb "phased migration is safer" into a fixed procedure.
#!/usr/bin/env bash
# guarded_upgrade.sh — commit the update only if smoke passes, else revert
set -euo pipefail
GOLDEN_DIR="${HOME}/.agy_smoke"
prev="$(agy --version | awk '{print $NF}')"
echo "current: $prev"
# 1) pre-update smoke (prove the current environment is healthy first)
if ! ./smoke_check.sh; then
echo "Already red before updating. Don't update until you re-baseline."; exit 1
fi
# 2) update (pin the version explicitly; avoid auto-tracking latest)
target="${1:?usage: guarded_upgrade.sh <version>}"
agy self update --version "$target"
# 3) post-update smoke. roll back immediately if it fails
if ./smoke_check.sh; then
echo "✅ $target passed smoke. Committing."
agy --version > "$GOLDEN_DIR/version.txt"
else
echo "🚨 $target failed smoke. Reverting to $prev."
agy self update --version "$prev"
./smoke_check.sh && echo "rollback complete (healthy on $prev)"
exit 1
fiThe option names and subcommands for agy self update vary by environment and version, so confirm the actual syntax with agy self update --help before wiring it in. The core idea is "don't auto-track latest; explicitly pin only the version that passed verification" — adapt the command details and it works as is.
Since putting this in place, my attitude toward point releases changed. Instead of postponing updates out of fear or jumping to the latest unguarded, I take it in if a 30-second smoke passes and revert if it fails. Running the several Dolice Labs sites alone, I can't spend time deliberating over each update, so this "proceed if it passes, revert if it fails" automation paid off.
Shift to defending small
Against the uncertainty of point releases, reading the changelog and bracing yourself doesn't last. Instead, baseline a single point — the output of your own use case — and compare it mechanically across the update. Just shifting what you defend from "did the command succeed" to "the usual output shape" catches most quiet regressions.
As a next step, pick the one path that hurts most if it breaks and capture a baseline with capture_golden.sh. Even a single case gives you a foothold to decide, in seconds, "should I take this in?" when the next point release lands. I hope it helps anyone caught between unattended operation and a steady stream of updates.