ANTIGRAVITY LABJP
Articles/Agents & Manager
Agents & Manager/2026-06-17Advanced

Making Managed Agent Batches Safe to Re-run: Idempotency and Checkpoints

Running overnight batches on the Antigravity 2.0 Managed Agents API makes recovery from partial failure unavoidable. Starting from a duplicate-post incident, I share the implementation of idempotency keys, a checkpoint store, and resume logic, with real numbers from solo operations.

antigravity365multi-agent44idempotency8batch2operations14managed-agentsreliability8

Premium Article

At 2 a.m. my article-generation batch died midway through the third job, when the Managed Agent session timed out. I noticed the next morning and simply re-ran the same batch without thinking. The first two jobs were posted twice, and the site hit a slug collision. The cause was trivial: my batch recorded nowhere how far it had gotten.

The Antigravity 2.0 Managed Agents API lets an agent plan, run code, and manipulate files autonomously inside a sandbox. Running long tasks without tying up my own machine is a real advantage. But that very "runs away from your hands" nature turns partial failure into a daily assumption. As an indie developer running several apps and several sites alone, overnight unattended batches are essential. That is exactly why a design that is safe to re-run mattered more to me than any flashy feature.

Thinking in "success or failure" guarantees breakage

While you drive an agent interactively in local mode, you are watching when it stops. You can see where it halted and tell it to continue. Unattended Managed Agent runs have no such watching human.

Unattended batches do not fail only because of agent bugs. Session timeouts, rate limits, transient network drops, sandbox restarts — interruptions that are not your fault happen routinely. A batch does not converge to "all succeeded" or "all failed." The normal stopping state is "succeeded through job 3 of N, then interrupted."

If you design the re-run as "start over from scratch," already-completed work runs again. For side effects that reach the outside world — posting, billing, sending email — that is an incident waiting to happen.

Derive the idempotency key from a natural key

The first step is to give every unit of work an idempotency key: an identifier that guarantees "the same input converges to one result no matter how many times you run it."

My first mistake was using a random UUID as the key. A new UUID on each re-run meant no idempotency at all. The key must be a natural key derived deterministically from the input.

import { createHash } from "node:crypto";
 
// The minimal input that identifies the work
interface Job {
  site: string;       // "antigravitylab"
  category: string;   // "agents"
  slug: string;       // article slug
}
 
// Idempotency key derived deterministically from input.
// The same Job always yields the same key.
function idempotencyKey(job: Job): string {
  const canonical = `${job.site}:${job.category}:${job.slug}`;
  return createHash("sha256").update(canonical).digest("hex").slice(0, 32);
}

The point is to exclude "when it ran" and "which attempt" from the key material entirely. The moment you mix in a timestamp or retry count, a re-run produces a different key and idempotency collapses. The key is built only from "what to process."

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Get the full TypeScript implementation of an idempotency key and checkpoint store that never double-posts on re-run
Learn the resume logic that reprocesses only unfinished jobs, and the locking pitfalls that prevent duplicate batch starts
See the retry and alerting rules that cut overnight batch failure rate from about 12% to 0.4%
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Agents & Manager2026-06-17
Accounting for Which Agent Spent What: A Cost Attribution Design by Task
Your month-end bill is one number, but running multiple agents on Gemini 3.5 Flash hides which task ate the cost. Separate from a budget guard, I share a cost-attribution accounting design that maps usage to per-task and per-site cost, with a solo-operator implementation and numbers.
Agents & Manager2026-06-17
Tracing Parallel Agents After the Fact: Observability with Structured Logs and Spans
Running multiple agents in parallel on the Antigravity 2.0 desktop makes it impossible to tell which one is doing what. I share an observability design that drops tangled print debugging for run_ids and spans you can trace afterward, with a solo-operator implementation and numbers.
Agents & Manager2026-06-02
Rehearsing an Agent's Actions Before They Touch Production — Designing a Zero-Side-Effect Dry-Run Layer
Some accidents survive shadow mode and canaries: the very first time an agent touches an external API. This is the design and TypeScript implementation of a zero-side-effect dry-run layer you can bolt onto Antigravity's parallel agents, with the real numbers from running six sites autonomously.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →