Designing Antigravity Retry for Production — Idempotency, Backoff, Cost Ceilings, and Failure Triage
A production blueprint for Antigravity agent retry: idempotency, exponential backoff, token-budget ceilings, failure triage, and resumable checkpoints.
If you have been using Antigravity for any length of time, you probably know the feeling of watching an agent fail, hitting Retry, and watching it fail at exactly the same spot. For a personal experiment, that is fine. For a client project, a shipped product, or an overnight schedule task, that naive Retry can quietly turn into a real problem. Twenty failed attempts in a row means twenty times the token spend. Side-effecting actions — sending email, charging cards, writing to your database — become far more expensive than the API bill if Retry fires them twice.
I have been on the painful side of this myself. Running an agent job overnight, waking up to find the API spend was five times what I had budgeted. That experience taught me something I now believe is fundamental: Retry is not a feature you use, it is a system you design. This article lays out how I take Antigravity retry from the default button press to a production-grade architecture, built around four pillars: idempotency, exponential backoff with jitter, cost ceilings, and failure triage. Every section ships with code you can lift into your own project today.
When Retry Stops Being a Button and Starts Being a System
For most of us, Retry begins life as a UI affordance — the button on the Manager Surface, the CLI flag. That is the right tool while you are driving Antigravity interactively. The problem starts when you put that same Retry behavior behind an agent that runs while nobody is watching.
Production-grade Retry has four non-negotiable properties:
Idempotency: calling the same action with the same input any number of times produces exactly one side effect
Exponential backoff with jitter: retry intervals that do not make the outage worse
Cost ceilings: an explicit upper bound on tokens, API calls, or dollars per task
Failure triage: a clear rule for which errors should be retried, which should halt immediately, and which need a human
If any of these is missing, you should not be running the agent on a schedule. Think of it as driving onto the freeway without brakes. Once you have all four, overnight runs become boring, and boring is exactly what you want.
Idempotency — The Part Everyone Gets Wrong First
This is the most common blind spot in agent Retry design. "Just call the function again" sounds reasonable until you remember the agent writes to the outside world: files, APIs, payments, emails, database rows. The question we need to answer before every retry is not "can we call this again?" but "is there a cheap way to know this action has already been done?"
The pattern is straightforward. For each action that has a side effect, generate a unique idempotency key, check whether that key has already completed, and — if it has — skip the execution and return the cached result.
// src/retry/idempotency.ts// Guards side-effecting agent actions with an idempotency key.// Before executing, check if the key has already completed; cache results for 24h.import { createHash } from "node:crypto";type ExecutionResult<T> = { status: "completed" | "failed"; data?: T; error?: string; completedAt: string;};export class IdempotencyGuard { // Use Redis or a KV store in production. In-memory is for learning only. private store = new Map<string, ExecutionResult<unknown>>(); private makeKey(action: string, payload: unknown): string { const body = JSON.stringify({ action, payload }); return createHash("sha256").update(body).digest("hex").slice(0, 32); } async run<T>( action: string, payload: unknown, execute: () => Promise<T>, ): Promise<T> { const key = this.makeKey(action, payload); const cached = this.store.get(key); if (cached?.status === "completed") { // Repeat calls return the cached result without re-running side effects. return cached.data as T; } try { const data = await execute(); this.store.set(key, { status: "completed", data, completedAt: new Date().toISOString(), }); return data; } catch (err) { // Record failures but do not pin the key — next call retries. this.store.set(key, { status: "failed", error: err instanceof Error ? err.message : String(err), completedAt: new Date().toISOString(), }); throw err; } }}// Usageconst guard = new IdempotencyGuard();await guard.run("send_welcome_email", { userId: "u_123" }, async () => { await mailer.send({ to: "user@example.com", template: "welcome" }); return { queued: true };});// The first call sends the email; every subsequent call returns { queued: true } immediately.
The test that matters here: call the wrapped function twice with identical input and assert that the inner body runs exactly once. Without that test in place, idempotency is just a story you tell yourself.
Why the Key Is "Action Name + Input Hash"
Using a task ID or a fresh UUID as the idempotency key is a trap. The agent will cheerfully re-submit the same intent under different IDs and your guard will treat each one as a new job. Hashing action name together with the essential input fields catches duplicate work at the semantic level, not the identifier level. This is the same model as Stripe's idempotency keys.
Picking a Durable Store
Memory stores evaporate on restart, which is fine while you are developing and catastrophic in production. On Cloudflare Workers reach for KV. On Firebase, a single Firestore document with a TTL works well. On Redis, SET key value NX EX 86400 is the one-liner you want. Keep TTL at 24 to 72 hours — shorter and you reopen the duplicate window on retry, longer and the store bloats with no benefit.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Readers who have been hammering the Retry button and failing at the same spot will walk away with a rerun flow that is idempotent, budgeted, and backed by proper triage
✦You will learn how to decide which failures should be retried, which should be halted immediately, and which need a human in the loop — all as concrete code
✦You will leave with a cost-ceiling and checkpoint design that prevents the late-night agent run from quietly burning through your API budget while you sleep
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Exponential Backoff with Jitter — Don't Retry Into the Storm
"Failed? Retry immediately." is a recipe for attacking your own dependencies during an outage. LLM APIs in particular respond to pressure with rate limits, and an immediate retry just widens the wound. The well-known fix is exponential backoff, ideally combined with jitter to prevent synchronized reconnect storms. AWS's classic write-up on this pattern, Exponential Backoff and Jitter, is still the clearest summary.
// src/retry/backoff.ts// Exponential backoff + full jitter + max attempts + max delay.// Apply this only to errors we have classified as retry-eligible.export type RetryOptions = { maxAttempts: number; baseDelayMs: number; maxDelayMs: number; shouldRetry: (err: unknown, attempt: number) => boolean; onRetry?: (err: unknown, attempt: number, delayMs: number) => void;};const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));export async function retryWithBackoff<T>( fn: () => Promise<T>, opts: RetryOptions,): Promise<T> { let lastError: unknown; for (let attempt = 1; attempt <= opts.maxAttempts; attempt++) { try { return await fn(); } catch (err) { lastError = err; if (attempt === opts.maxAttempts || !opts.shouldRetry(err, attempt)) { throw err; } // Full jitter: random in [0, base * 2^attempt], capped by maxDelayMs. const expo = Math.min(opts.baseDelayMs * 2 ** attempt, opts.maxDelayMs); const delay = Math.floor(Math.random() * expo); opts.onRetry?.(err, attempt, delay); await sleep(delay); } } throw lastError;}// Usageawait retryWithBackoff( () => fetch("https://api.example.com/data").then((r) => r.json()), { maxAttempts: 5, baseDelayMs: 500, maxDelayMs: 30_000, shouldRetry: (err) => isTransientError(err), // see "Failure Triage" below onRetry: (err, attempt, delay) => { console.log(`retry attempt=${attempt} delay=${delay}ms error=${String(err)}`); }, },);
The expected behavior is straightforward: success returns immediately, retry-eligible failures wait a jittered exponential delay and try again, and we throw once attempts are exhausted or the error becomes non-retryable. The onRetry hook is where you should wire logging and metrics — leaving it empty in production is how incidents become invisible.
Why I Reach for Full Jitter First
There are fancier jitter strategies — equal jitter, decorrelated jitter — and each has a fan base. For a personal Antigravity agent, full jitter (uniform random in [0, max]) is the right default because it is trivial to implement, spreads the reconnection thundering-herd the most, and is easy to reason about. Start there, measure, and only move to something more complex if you see a concrete problem the simpler approach cannot solve.
Cost Ceilings — Catch the Runaway Before It Drains Your Account
A runaway agent burning tokens overnight is not a hypothetical. I have lived through a "5× expected spend" morning and it was entirely my fault for not putting a ceiling in place. Because Antigravity agents call external LLM APIs, the right unit for the ceiling is not "number of retries" — it is "tokens spent."
The design is three ideas:
Declare a token budget per task up front
Track cumulative usage on every retry and abort on overrun
Trigger alerts and drop back to the last checkpoint when the ceiling hits
// src/retry/budget.ts// Per-task token budget wired into the agent loop.// Extract `usage` from each LLM response and abort when the total crosses the ceiling.export class BudgetExceeded extends Error { constructor( public readonly used: number, public readonly budget: number, ) { super(`budget exceeded: used=${used} budget=${budget}`); }}export class TokenBudget { private used = 0; constructor( private readonly budget: number, private readonly onWarn?: (used: number, budget: number) => void, ) {} track(usage: { prompt_tokens: number; completion_tokens: number }) { this.used += usage.prompt_tokens + usage.completion_tokens; if (this.used > this.budget) { throw new BudgetExceeded(this.used, this.budget); } // Soft alert at 80% — hook into Slack, PagerDuty, or a metrics counter. if (this.used > this.budget * 0.8) { this.onWarn?.(this.used, this.budget); } } remaining(): number { return Math.max(0, this.budget - this.used); }}// Usage inside the agent loopconst budget = new TokenBudget(200_000, (used, total) => { console.warn(`budget at 80%: ${used}/${total}`);});for (const step of plannedSteps) { try { const response = await callLLM(step.prompt); budget.track(response.usage); applyResult(response); } catch (err) { if (err instanceof BudgetExceeded) { await notifyHuman("budget exceeded, human review required", { err }); await saveCheckpoint(step); throw err; // stop the whole loop } throw err; }}
Expected behavior: once usage crosses 80% of the ceiling a warning fires, once it crosses 100% the loop aborts, and a checkpoint plus alert reaches a human. The worst-case blast radius is now capped at one budget multiple rather than "as much as the credit card allows."
How to Size a Budget Sanely
If you are doing this for the first time, these rules of thumb work well:
Small tasks (single-file edits, doc updates): 10,000 to 30,000 tokens
Medium tasks (multi-file refactors): 50,000 to 150,000 tokens
Large tasks (feature scaffolds, migrations): 300,000 to 500,000 tokens
If you are tempted to go higher, that is usually a signal that the task is too big to hand to an agent in one shot. Either split it with Planning Mode or take it yourself. The splitting heuristic is covered in more depth in my piece on Antigravity Planning Mode strategies for large projects, which pairs nicely with the budgets above.
Failure Triage — Not All Errors Deserve a Retry
The biggest cause of runaway retries is treating every failure identically. In reality, agent errors fall into three buckets:
Transient: network blips, rate limits, timeouts — almost always recover on retry
Permanent: auth failures, permission errors, validation failures, unknown model names — the same result every time
Encoding this triage into code makes the "should we retry" decision unambiguous.
// src/retry/triage.ts// Classify an error as transient, permanent, or ambiguous.// Grow this function over time — every new error shape you see should add a rule.export type ErrorClass = "transient" | "permanent" | "ambiguous";export function classifyError(err: unknown): ErrorClass { if (!(err instanceof Error)) return "ambiguous"; const msg = err.message.toLowerCase(); // --- Auth, permission, and input errors are permanent --- if (/unauthorized|forbidden|invalid_api_key|authentication/.test(msg)) { return "permanent"; } if (/validation|invalid request|bad request|schema/.test(msg)) { return "permanent"; } if (/model_not_found|unknown model/.test(msg)) { return "permanent"; } // --- Rate limits and transient network errors are retryable --- if (/rate.?limit|too many requests|429/.test(msg)) { return "transient"; } if (/timeout|econnreset|etimedout|network/.test(msg)) { return "transient"; } if (/503|service unavailable|overloaded/.test(msg)) { return "transient"; } // --- Ambiguous ones get exactly one retry --- if (/500|internal server|json|parse/.test(msg)) { return "ambiguous"; } return "ambiguous";}// Plugging the triage into the retry loop's shouldRetryexport function shouldRetryFromTriage(err: unknown, attempt: number): boolean { const klass = classifyError(err); if (klass === "permanent") return false; if (klass === "transient") return attempt < 5; if (klass === "ambiguous") return attempt < 2; // single retry return false;}
This function is never done. Every new failure mode is an excuse to add another regex or two. The reward for keeping it up to date is a triage tuned to your real production traffic — no generic AI-written classifier will match that.
Checkpoints — Retry Without Redoing the First 95 Steps
Even with good triage and a hard budget, losing a 100-step agent task on step 95 is a waste of time and money. Checkpoints fix this by persisting "we got this far" so retry can resume rather than restart. Antigravity's Manager Surface does some of this automatically, but any agent you build yourself needs explicit support.
Three design rules:
Persist "already-completed step IDs" after each success
On retry, start from the first incomplete step — never from zero
Side-effecting steps must still run through the idempotency guard above
// src/retry/checkpoint.ts// Smallest useful "resumable agent" loop.// Swap fs for KV, Firestore, or any durable store in production.import fs from "node:fs/promises";type Step = { id: string; run: () => Promise<void> };type Checkpoint = { completed: string[]; updatedAt: string;};export async function runResumable( steps: Step[], checkpointPath: string,): Promise<void> { let cp: Checkpoint; try { cp = JSON.parse(await fs.readFile(checkpointPath, "utf-8")); } catch { cp = { completed: [], updatedAt: new Date().toISOString() }; } for (const step of steps) { if (cp.completed.includes(step.id)) { console.log(`skip already-completed step: ${step.id}`); continue; } try { await step.run(); cp.completed.push(step.id); cp.updatedAt = new Date().toISOString(); await fs.writeFile(checkpointPath, JSON.stringify(cp, null, 2)); } catch (err) { // Flush the checkpoint on failure so the next run resumes here. await fs.writeFile(checkpointPath, JSON.stringify(cp, null, 2)); throw err; } }}
After a failure, the next invocation should skip every completed step and pick up where things went wrong. For larger agent plans this is the single most valuable change you can make — retry cost approaches zero as the checkpoint matures. For a deeper walkthrough of the pattern inside Antigravity itself, see Antigravity Checkpoints and Rollback Mastery.
Observability — Retry in the Dark Is Retry You Cannot Trust
Retries run where nobody is watching, which makes observability non-optional. The minimum viable kit:
Alerts: budget exceeded, consecutive failures, any permanent classification
Distributed traces: one span per attempt, linked to the outer Antigravity run
You can get most of this from Langfuse, Helicone, or your own OpenTelemetry pipeline. For small personal projects I run Cloudflare Workers Analytics Engine plus a Slack Webhook and call it done. The goal is never a beautiful dashboard — it is being able to notice a runaway within five minutes of it starting. For a hands-on walkthrough of wiring this into Antigravity specifically, see Antigravity × Langfuse Observability Runbook.
Team Hygiene — Write a RETRY_POLICY.md Before You Write the Retry
Even if you are a solo developer, "Retry by vibes" is a mistake. Drop a RETRY_POLICY.md at the root of the project and fill in five sections:
maxAttempts per task category
baseDelayMs, maxDelayMs, and jitter strategy
Token budgets per task size (small, medium, large)
Triage rules — which error messages count as permanent
Notification routing and human runbook for budget-exceeded or permanent errors
Three Failure Modes You Will Probably Hit
Even with all four pillars in place, there are three recurring traps I keep falling into. Calling them out may save you a night.
1. Forgetting the Idempotency Guard on a New Side Effect
The most common production incident. Adding a new side-effecting action — send_email, create_payment, post_to_slack — without wrapping it in the guard leads to duplicate events the moment a retry fires. A useful habit: every new action needs a unit test that calls it twice and asserts the inner body ran exactly once.
2. Retrying Permanent Errors Until the Budget Is Gone
Auth failures, permission errors, and invalid model names never recover from retries. If they are classified as transient, the agent happily loops forever with billing. Make sure classifyError has CI coverage on unauthorized, forbidden, and invalid_api_key and that shouldRetry returns false for them.
3. A Budget Breach That Never Pages Anyone
The subtle one. You wired onWarn to a Slack webhook that has since expired, or to a dev channel nobody watches. Every alert pipeline needs a heartbeat. A simple "always-warn" dummy task that fires every 24 hours is the cheapest way to verify the notification chain is still alive.
4. Forgetting That Partial Completions Are a Triage Category Too
This one took me embarrassingly long to notice. The agent runs, returns something that looks like a valid response, but actually only solved half the task — say, it wrote the new function but forgot to update the tests, or it deleted the old file without creating the new one. If you treat "valid-looking response" as "success," the retry loop never kicks in, and the failure propagates silently downstream. The fix is to add a lightweight post-condition check for every step — even a single assert on expected output shape — and treat a failed check as an ambiguous-class error. One extra line of validation per step is cheap insurance against this shape of mistake.
A Walkthrough: Wiring the Pieces Together
It helps to see the whole system in one place, because the individual snippets above can feel disconnected. Here is a minimal end-to-end agent loop that uses all four pillars at once. I have deliberately kept it to what fits on a single screen.
// src/agents/resilient-loop.ts// The four pillars composed: idempotency, backoff, budget, triage, plus checkpoints.import { IdempotencyGuard } from "../retry/idempotency";import { retryWithBackoff } from "../retry/backoff";import { TokenBudget } from "../retry/budget";import { shouldRetryFromTriage, classifyError } from "../retry/triage";import { runResumable } from "../retry/checkpoint";const guard = new IdempotencyGuard();const budget = new TokenBudget(200_000, (used, total) => { console.warn(`budget at 80%: ${used}/${total}`);});type TaskStep = { id: string; action: string; payload: unknown };async function runStep(step: TaskStep) { return retryWithBackoff( () => guard.run(step.action, step.payload, async () => { const response = await callLLM({ action: step.action, payload: step.payload }); budget.track(response.usage); return response; }), { maxAttempts: 5, baseDelayMs: 500, maxDelayMs: 30_000, shouldRetry: (err, attempt) => { // Permanent errors and budget breaches escape the retry loop. if (classifyError(err) === "permanent") return false; if (err instanceof Error && err.message.includes("budget exceeded")) return false; return shouldRetryFromTriage(err, attempt); }, onRetry: (err, attempt, delay) => { console.log(`[${step.id}] retry ${attempt} in ${delay}ms: ${String(err)}`); }, }, );}export async function runTask(steps: TaskStep[]) { const adapted = steps.map((s) => ({ id: s.id, run: () => runStep(s).then(() => {}) })); await runResumable(adapted, `.agent/checkpoint-${Date.now()}.json`);}
Reading this top to bottom, you can see each pillar in its own layer: the guard prevents double side effects, the backoff handles transient failures, the budget prevents cost runaway, the triage decides what is worth retrying, and the resumable wrapper lets us pick up exactly where we left off. No single piece is exotic. What makes it production-grade is that they are all present, and that a team reading the file can trace each decision back to an explicit rule.
How the System Behaves Under Real Failures
Spend a few minutes imagining the loop under concrete failure scenarios. A 429 rate limit arrives mid-task: triage says transient, backoff waits with jitter, the next attempt succeeds, the budget ticks up. An auth token has rotated: triage says permanent, the loop throws immediately, nothing is retried, a human wakes up to a single clean error. A runaway chain of "the model keeps producing slightly different malformed JSON": triage says ambiguous, one retry is allowed, then we give up — the budget never gets anywhere near the ceiling. A truly massive prompt accidentally gets committed: the budget ceiling trips, a checkpoint drops, a Slack alert fires, and the loop halts at the end of the step we were in. Every one of these scenarios is boring, predictable, and — most importantly — bounded.
What Changes Once Retry Is Designed, Not Improvised
There is a quiet benefit to all of this that I did not expect when I first started treating retry as a system. The act of writing the policy and the code forces you to think about your agent's blast radius. You start asking: what is the worst this action can do? Is my ceiling actually sized correctly for the job I am asking it to do? Do I actually trust this prompt enough to let it loose overnight? Those questions surface design issues long before they become incidents. In a real sense, the best bug you can find is the one you find while writing the retry policy, because that is the cheapest possible moment to fix it.
The second shift is psychological. Once I had the four pillars in place across my own projects, overnight agent runs stopped being a source of anxiety. I still check the logs in the morning, but I check them the way you check a thermostat — not the way you check a suspected leak. That change is worth the weekend it takes to implement, several times over.
Four pillars, a handful of code snippets, and a short policy document. That is the whole retry system. If you only adopt one of them today, pick the idempotency guard and wrap a single side-effecting action in your Antigravity agent with it. The duplicate-email incident — the one that hurts the most when it happens — becomes dramatically less likely. From there, the rest of the architecture layers on naturally: backoff, budget, triage, checkpoint. In a week or two of incremental work, your agent reliability will not feel the same, and the retry policy you wrote on day one becomes the living document your future self and your teammates rely on every time you extend the agent's responsibilities.
Antigravity, designed properly, is an agent you can genuinely leave running overnight. The steering is still our responsibility, but once the retry architecture is in place it becomes quiet, patient, and — most importantly — boring in the best way. Boring is what reliability looks like. Boring is what sleep looks like. And in a field where "the agent did something unexpected" still makes headlines, boring is a competitive advantage worth engineering for.
If you take away nothing else, take away this: treat every Retry button press as a signal that something in your design is missing, and fix the design rather than the symptom. The button is for debugging; the four pillars are for production. Once you have internalized that distinction, the rest of this article will feel less like a recipe and more like a checklist you already knew you needed.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.