When Your Agent Automation Breaks: How Many Minutes to Recovery?
As Antigravity 2.0 adds desktop, CLI, and SDK surfaces, the things you must restore after a failure multiply too. As an indie developer running several sites on autopilot, I lay out a three-layer recovery design covering credentials, definitions, and state, plus a monthly restore drill.
One morning I realized my auto-publishing had stopped because the usual notification never arrived. I had wiped my working machine the night before, expecting everything to come back the next morning. What came back was only the generation pipeline's code. The credentials, the schedule, and the half-finished generation state were nowhere to be found.
When you run four sites on autopilot as an indie developer, stopping is not an "if" but a "when." The question is never whether it stops, but how many minutes it takes to return to the previous state. Now that Antigravity 2.0 has spread operations across a desktop app, a CLI, and an SDK, the surface you have to restore has grown by the same amount. Here I want to describe a design that turns recovery from "remember it under pressure" into "follow a procedure."
Treating recovery as one blob guarantees gaps
My first mistake was treating the automation as a single lump. "The code is in git, so we're fine" — that comfort quietly erased credentials and schedule definitions from view.
Recovery targets differ in nature. When you try to protect things of different natures in the same place at the same cadence, the whole thing gets dragged down to whatever is hardest to protect. So I split what I protect into three layers:
Credential layer: API keys, CLI tokens, Google Play and AdMob credentials
State layer: in-progress checkpoints, run logs, records of what has already been published
These three differ both in how much it hurts to lose them and in how you bring them back. Only after splitting them could I choose a protection that fits each.
Give each layer its own recovery point objective
You do not need to back everything up at the same frequency. I give each layer its own recovery point objective (RPO) — how far back in time you can tolerate losing data to.
Layer
Target RPO
Where it lives
Credentials
0 (always current)
Secret manager + an offline copy
Definitions
24 hours
git repository
State
1 hour
Auto-synced to object storage
Credentials do not "roll back" usefully. Restoring an old token still forces re-authentication, so instead of an RPO, the right idea is to keep exactly one source of truth somewhere you can always reach. I treat a secret manager as that source of truth, but I also keep the login path to the management console — the entry point for everything else — in an offline copy. If I cannot reach that, I cannot reach anything else.
Definitions belong in git, history and all. Keeping the schedule and the agent instructions under version control lets me return not to the broken moment but to the last known-good state before it broke.
State changes fast, so it needs a short sync interval. I copy only the in-progress checkpoints and the publish records to object storage every hour. Syncing everything hourly would be heavy, so the trick is to carve out only what changes quickly.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A concrete design that splits recovery into credentials, definitions, and state, each with its own recovery point objective (RPO)
✦The three classic ways a backup exists but cannot be restored, and a monthly restore drill that surfaces them first
✦The breakdown of how my own restore time dropped from 2 hours to 25 minutes after rebuilding from an empty machine
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Here is the heart of it. I have grouped my "the backup existed but would not restore" experiences into three types.
Never restored: the capture is automated, but the restore path has never been walked once. In a real incident it becomes first-time work, and you burn time.
Credentials escape into manual steps: you can restore code and definitions, but the re-auth path lives in human memory and stalls at night or on the move.
Partial recovery diverges: code is current, definitions are from yesterday, state is from three days ago — versions drift and the "recovered" pipeline runs on a stale schedule.
What all three share is that the missing piece is not "capturing the backup" but "verifying the restore." Capture succeeds quietly; a restore does not count as successful until you actually perform it.
Make the restore drill a monthly habit
So I turned restoring into a monthly routine rather than a once-a-year emergency. From an empty directory, I rebuild the automation and carry it through to the point where the first article is correctly generated and ready to publish.
# restore-drill.sh — rebuild the automation from an empty environmentset -euo pipefailDRILL_DIR="$HOME/restore-drill/$(TZ=Asia/Tokyo date +%Y%m%d)"mkdir -p "$DRILL_DIR" && cd "$DRILL_DIR"# 1) Pull the definition layer from gitgit clone --depth 1 "$DEFINITION_REPO" definitions# 2) Pull the state layer from storage (latest checkpoint)mkdir -p staterclone copy "remote:agent-state/latest" state --checkers 8# 3) Inject the credential layer from the secret manager (export to env only; never to a file)eval "$(secret-cli export --scope agent-automation)"# 4) Actually generate the first article, up to just before publishingagy run definitions/pipeline.yaml --input state/next.json --dry-runecho "✅ Restore verified: $(TZ=Asia/Tokyo date +%H:%M)"
The --dry-run flag exists so the drill never double-publishes to production. It runs generation and assembly at production parity and stops only the final publish. Forget to stop there, and the drill becomes the incident.
If you record the elapsed time each drill, the quality of your recovery design shows up as a number. My first run took two hours: 50 minutes hunting for the re-auth path, 40 minutes deciding which definition was authoritative, and the rest on real work. After fixing the path into a runbook and consolidating the authoritative definition into a single repository, the third run dropped to 25 minutes. Almost all of the savings came not from working faster but from removing hesitation.
What to automate and what to leave to a human
I understand the urge to fully automate recovery, yet I deliberately keep re-authentication as a human step. A setup that can restore credentials with zero human action is, flipped around, a setup that can be hijacked with zero human action if someone seizes the mechanism.
My rule is simple. Things that hurt little when lost and change often (the state layer) go fully automatic. Things that are fatal when lost and rarely change (the credential layer) keep one last manual step. I draw this line not by leaning toward convenience or safety as a whole, but from the premise that the optimum differs per layer.
Running apps for years as an indie developer, I have several places — the AdMob console, App Store delivery settings — where a stoppage hits revenue directly. The more a place is like that, the more it is worth touching its restore path once a month.
How to judge that a restore actually worked
When you run the drill, ending on "it feels like it's back" drains the value. I now judge a restore as successful by three checks rather than by feel.
First, the first article must pass the quality gate. Second, the schedule's next run time must be computed correctly. Third, the publish record must not diverge from the state layer. Only when all three hold do I call the restore a success.
The third is the easy one to miss. If you restore the state layer from an old point, the agent can mistake an already-published article for unpublished and head toward a double post. I once nearly shipped the same article twice this way and was saved by an idempotency check right before publishing. Ever since, the last step of every restore reconciles the publish record against reality.
The effect showed up in numbers too. Before adding these checks, even a completed drill left about 20% small mismatches in production. After making the three checks mandatory, post-deploy fixes dropped to nearly zero. The stricter the judgment, the more it reduces after-the-fact rework rather than the restore itself. My recommendation is to decide, before you optimize for speed, on the bar that lets you say with confidence that the restore is done.
Your next step
Start by writing your own automation out into the three layers. Just laying credentials, definitions, and state into three columns and noting where each thing currently lives will surface the layer you are not protecting. In my case the state layer was missing entirely, and that is what led to the morning above. Once you have written it out, set aside one weekend and walk a single restore from an empty directory. That first elapsed time is the current position of your recovery design.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.