Handing Dependency Updates to Antigravity Agents — Risk Tiers, Verification, and Rollback
How far can you trust Antigravity agents with dependency updates? A four-tier risk model that corrects semver optimism, worktree-isolated lots, a fixed verification script, and a rollback-first ledger — the operations design I settled on while maintaining multiple apps.
On a Monday morning I opened my repositories to find 47 pending dependency update notifications. When you maintain several apps and sites in parallel, that is not an unusual number. Reading every release note costs half a day. Ignoring them is worse: security fixes slip past you while the diff quietly grows, and the eventual upgrade becomes more dangerous the longer you wait.
I once let a minor Firebase SDK update sit for three months because everything was working. By the time I finally upgraded, the chained dependency changes had grown so large that the repair consumed an entire weekend. That experience changed how I think about dependency maintenance: it is not a chore you batch up and attack once a quarter, it is a small stream you keep flowing. And small, continuous, well-defined work happens to be exactly the kind of work agents are best suited to take over.
Antigravity 2.0 made that handover much more practical, with proper worktree support and scheduled runs. What follows is the operations design I settled on, built from four parts — risk tiers, lot batching, verification, and rollback — with the configuration files and scripts included in a form you can reuse directly.
Why updates are easy to delegate, and why you still cannot delegate all of them
Dependency updates suit agents for three reasons. First, the procedure is highly repetitive. Second, success is machine-checkable: build, type check, tests, and smoke checks form a ready-made verdict. Third, the work recurs weekly, so any automation investment pays itself back quickly.
The failure profile, however, is asymmetric. Nobody notices fifty clean updates, but a single bad one blocks a release. And the adopt-or-defer decision mixes in context that lives outside the repository: you might skip updates for a module you plan to rewrite this year, or avoid touching an ads SDK in the same window as a policy change. No agent can infer those constraints from the codebase alone.
So I split the job into two halves: proposal, verification, and record-keeping go to the agent; adoption decisions go to either automation or a human, depending on the risk tier. Instead of granting uniform discretion, the discretion scales with how dangerous the change is. Everything below exists to implement that graduated autonomy.
Do not take semver at face value — a four-tier risk model
Semver is a useful promise, not a guarantee. Breaking changes shipped as patches are real; in 0.x ecosystems a minor is effectively a major; and native SDKs do not even promise to follow the same convention. So I start from the formal version bump and then correct it with track record, ending up with four tiers.
Tier 0 — auto-merge: patch updates of transitive dependencies, fully contained in the lockfile. Security-advisory fixes belong here too
Tier 1 — auto-merge after overnight verification: patches of direct dependencies that do not touch public APIs, type definitions, or build configuration
Tier 2 — agent proposes, human approves: minors of direct dependencies. Anything that moves peer dependency requirements always lands here
Tier 3 — human-led, agent researches only: majors, native SDKs, build toolchain (Gradle, Xcode, bundlers), and everything in 0.x
Two escalation rules sit on top. Any package that has shipped a breaking change inside a patch or minor within the last twelve months gets bumped one tier, unconditionally. So does any package that publishes empty release notes — a change that does not explain itself is dangerous by definition.
The classification lives in a JSON file at the repository root, and the agent is told to treat that file as its only source of judgment.
The escalations block is the part that matters. A package the wider world considers safe still gets a minTier floor if it has burned me once in my own environment. This table is institutional memory in file form; I expect to grow it after every incident, not to get it right up front.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A reusable four-tier risk classification that corrects semver with each package's track record, and changes how much discretion the agent gets per tier
✦The actual AGENTS.md playbook and verification script that isolate update lots in worktrees and standardize everything from checks to rollback
✦Field numbers from processing roughly 60 update notifications a month — which changes are safe to auto-merge and which ones a human must review
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Lot batching and worktree isolation — how you group updates decides your incident rate
One branch per update is the ideal, but it collapses under a few dozen notifications per week. The opposite extreme — all 47 in one branch — is the worst possible move: you cannot tell which update broke the build, and the rollback unit is destroyed.
Between the two I use a unit I call a lot, assembled with three rules.
A lot contains updates of a single tier only, so the rollback unit stays uniform
A lot stays within one ecosystem — npm and CocoaPods never mix
A lot holds at most five updates, but packages sharing a peer dependency deliberately ride in the same lot
The third rule came from experience. Upgrade two packages that share a peer in separate lots and the resolver oscillates, rewriting the lockfile back and forth. Upgrade them together and it converges in one pass.
Each lot gets its own isolated workspace using Antigravity 2.0's worktree support. With one worktree per agent, parallel runs never collide in the working tree.
# Create one worktree per lot and assign an agent to eachgit worktree add ../app-deps-lot-01 -b deps/lot-01-tier0-npmgit worktree add ../app-deps-lot-02 -b deps/lot-02-tier1-npm# In Antigravity, open each worktree as a project and instruct simply:# "Process lot-01 according to deps-policy.json"
A quiet bonus of worktree isolation: discarding a failed lot means deleting a directory. No half-applied state ever touches the main working tree, which makes it much easier to let an agent experiment.
The update playbook in AGENTS.md — the actual instructions
Instead of re-writing instructions in every prompt, the procedure lives permanently in AGENTS.md. Here is the dependency-update section verbatim.
# Dependency Update Playbook### Procedure1. Read deps-policy.json and confirm the lot's tier and permitted action2. Before updating, fetch and read each package's CHANGELOG / release notes3. Apply the updates. The only files you may change are package.json and the lockfile4. Run scripts/verify-deps.sh and record the result of every step5. Whatever the outcome, write an update summary to reports/deps/ as Markdown6. Commit only if the lot is Tier 0-1 and all checks passed (one lot = one commit)7. Otherwise, stop with the changes intact and wait for a human decision### Forbidden- Applying major updates or anything Tier 3 (research and report only)- Changing any file other than package.json / the lockfile (if code changes are needed, stop and report)- Skipping verification steps or loosening thresholds- Applying a package whose CHANGELOG cannot be retrieved (stop and report)### Required fields in the update summary- Package name and old → new version- Summary of changes (three lines or fewer)- Whether it affects this codebase, with the reasoning- Verification results, with per-step timing
The single most effective line is the one that makes a CHANGELOG summary a mandatory deliverable. Human review changes from reading diffs to checking summaries, and a lot can be judged in a few minutes. Skip that requirement and review degenerates back into doing the whole job yourself, which defeats the handover.
The other critical line is stop if code changes are needed. If the agent is allowed to modify code to chase an update, the blast radius of the change becomes unreadable. An update that requires code changes is, by that very fact, Tier 2 work or above.
The verification pipeline — a passing build is only half the answer
Verification is a fixed shell script, not something left to the agent's judgment. Asking an agent to please run the tests and check is nothing like gating on the exit code of a known script; the difference in reliability is enormous.
#!/usr/bin/env bash# scripts/verify-deps.sh — verification for a dependency update lotset -euo pipefailecho "[1/5] install"pnpm install --no-frozen-lockfileecho "[2/5] typecheck"pnpm tsc --noEmitecho "[3/5] unit tests"pnpm vitest run --reporter=dotecho "[4/5] build + bundle size"pnpm next buildNEW_KB=$(du -sk .next/static | cut -f1)BASE_KB=$(cat .deps-baseline/bundle-kb 2>/dev/null || echo "$NEW_KB")LIMIT=$((BASE_KB * 103 / 100)) # stop at +3% vs. baselineif [ "$NEW_KB" -gt "$LIMIT" ]; then echo "FAIL: bundle size ${BASE_KB}KB -> ${NEW_KB}KB (over +3%)" exit 1fiecho "[5/5] smoke"pnpm playwright test e2e/smoke --reporter=linemkdir -p .deps-baseline && echo "$NEW_KB" > .deps-baseline/bundle-kbecho "OK: all checks passed"
The bundle-size gate is there because size regressions are the classic failure that tests never catch. A patch update of a utility library once pulled in extra polyfills and inflated the build by about seven percent. Every test passed; without the threshold it would have shipped unnoticed.
For the same reason, the smoke suite includes one check that asserts output has not changed, not merely that features run. That one was added after a date library patch changed its default timezone behavior — tests green, yet every rendered date was off by a day. In my environment the five steps take about seven minutes per lot, so even three lots in parallel fit comfortably inside a thirty-minute overnight window.
There is also a fixed response when verification fails: split the lot in half, re-run, identify the offending package, and record just that package as deferred. It is bisection in miniature — with a five-update cap, three splits are always enough.
Rollback and the update ledger — move in units you can undo
The property that matters most in the long run is how cheap rollback is. As long as one lot equals one commit, recovery from a bad update is a single git revert. The moment several lots or a manual fix share one commit, rollback stops being mechanical and becomes thinking work again.
The second quietly powerful habit is the update ledger. Adopted updates, deferred updates, and reverted updates all get a line of NDJSON.
# Appending to reports/deps/ledger.ndjson (part of the agent's deliverables)echo '{"date":"2026-06-12","lot":"lot-01","pkg":"date-fns","from":"4.1.0","to":"4.1.2","tier":1,"decision":"adopted","verify":"pass","note":"confirmed no TZ default change"}' >> reports/deps/ledger.ndjson# Monthly aggregation is one line of jqjq -s 'group_by(.decision) | map({decision: .[0].decision, count: length})' reports/deps/ledger.ndjson
The reason to record deferrals is simple: without a record, you re-investigate the same update every single week. Write down once that this minor moves peer requirements, handle it in the next Tier 2 slot, and the next lot assembly is just reading that line. The ledger is a memo for the human and, at the same time, input for the agent's next run.
One month of real numbers, and what I still refuse to delegate
After a month of running this setup, here are the numbers from my environment. Roughly 60 update notifications per month; just under seventy percent were Tier 0-1 and processed automatically overnight. My own involvement is one weekly review of about forty minutes — the same work used to take half a day. Two rollbacks occurred during the month, and both were resolved with a single revert. More than any number, though, the biggest change is that the background fatigue of always being behind on updates simply disappeared.
On cost: the default Flash-class model is entirely sufficient for Tier 0-1 routine work, and I reserve the higher-end model for Tier 3 impact research. Antigravity 2.0's quota visibility has drawn fair criticism, but since splitting cheap models onto low-risk work and premium models onto research, parallel runs have stopped hitting limits altogether.
Some territory remains off-limits. Adoption decisions for major updates, final calls on updates that change license terms, and anything touching native app permissions or privacy manifests stay with me. As an indie developer, a wrong call in those areas lands directly on the business, and the recovery cost dwarfs whatever time the delegation saves. The agent's research reports make those decisions faster; the decisions themselves are not up for handover any time soon.
As a first step, write a deps-policy.json for your own repository and hand the agent nothing but Tier 0 for two weeks. Widen its discretion only after the ledger shows a clean track record — there is no rush. If you are facing the same mountain of update notifications, I hope this design saves you some weekends.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.