Making Managed Agent Batches Safe to Re-run: Idempotency and Checkpoints

Running overnight batches on the Antigravity 2.0 Managed Agents API makes recovery from partial failure unavoidable. Starting from a duplicate-post incident, I share the implementation of idempotency keys, a checkpoint store, and resume logic, with real numbers from solo operations.

antigravity⁴³⁶ multi-agent⁵⁰ idempotency¹¹ batch² operations²⁶ managed-agents³ reliability¹¹

✦ Premium Article

At 2 a.m. my article-generation batch died midway through the third job, when the Managed Agent session timed out. I noticed the next morning and simply re-ran the same batch without thinking. The first two jobs were posted twice, and the site hit a slug collision. The cause was trivial: my batch recorded nowhere how far it had gotten.

The Antigravity 2.0 Managed Agents API lets an agent plan, run code, and manipulate files autonomously inside a sandbox. Running long tasks without tying up my own machine is a real advantage. But that very "runs away from your hands" nature turns partial failure into a daily assumption. As an indie developer running several apps and several sites alone, overnight unattended batches are essential. That is exactly why a design that is safe to re-run mattered more to me than any flashy feature.

Thinking in "success or failure" guarantees breakage

While you drive an agent interactively in local mode, you are watching when it stops. You can see where it halted and tell it to continue. Unattended Managed Agent runs have no such watching human.

Unattended batches do not fail only because of agent bugs. Session timeouts, rate limits, transient network drops, sandbox restarts — interruptions that are not your fault happen routinely. A batch does not converge to "all succeeded" or "all failed." The normal stopping state is "succeeded through job 3 of N, then interrupted."

If you design the re-run as "start over from scratch," already-completed work runs again. For side effects that reach the outside world — posting, billing, sending email — that is an incident waiting to happen.

Derive the idempotency key from a natural key

The first step is to give every unit of work an idempotency key: an identifier that guarantees "the same input converges to one result no matter how many times you run it."

My first mistake was using a random UUID as the key. A new UUID on each re-run meant no idempotency at all. The key must be a natural key derived deterministically from the input.

import { createHash } from "node:crypto";
 
// The minimal input that identifies the work
interface Job {
  site: string;       // "antigravitylab"
  category: string;   // "agents"
  slug: string;       // article slug
}
 
// Idempotency key derived deterministically from input.
// The same Job always yields the same key.
function idempotencyKey(job: Job): string {
  const canonical = `${job.site}:${job.category}:${job.slug}`;
  return createHash("sha256").update(canonical).digest("hex").slice(0, 32);
}

The point is to exclude "when it ran" and "which attempt" from the key material entirely. The moment you mix in a timestamp or retry count, a re-run produces a different key and idempotency collapses. The key is built only from "what to process."

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Get the full TypeScript implementation of an idempotency key and checkpoint store that never double-posts on re-run

✦Learn the resume logic that reprocesses only unfinished jobs, and the locking pitfalls that prevent duplicate batch starts

✦See the retry and alerting rules that cut overnight batch failure rate from about 12% to 0.4%

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Push progress outside the process with a checkpoint store

Once the key is fixed, record whether each job is pending, running, done, or failed — outside the batch process. Held in process memory, that state vanishes the instant the process dies.

I use Cloudflare KV, but SQLite or Redis work just as well. What matters is that state transitions are atomic and that recording completion is committed after the side effect.

type JobState = "pending" | "running" | "done" | "failed";
 
interface CheckpointStore {
  get(key: string): Promise<JobState | null>;
  set(key: string, state: JobState, ttlSec?: number): Promise<void>;
}
 
class KvCheckpointStore implements CheckpointStore {
  constructor(private kv: KVNamespace) {}
 
  async get(key: string): Promise<JobState | null> {
    return (await this.kv.get(`ckpt:${key}`)) as JobState | null;
  }
 
  async set(key: string, state: JobState, ttlSec = 60 * 60 * 24 * 30): Promise<void> {
    await this.kv.put(`ckpt:${key}`, state, { expirationTtl: ttlSec });
  }
}

Ordering of the completion write is humble but decisive. If you "write done first, then post," a crash before posting leaves only done, and that job is skipped forever. If you "post, then write done," a crash after posting but before done just re-attempts the post on the next run. Choose the latter, and make the post itself tolerate duplicates via the idempotency key.

Resume logic that reprocesses only unfinished work

With this in place, a re-run is no longer "start over" but "pick up only the jobs that didn't finish."

async function runBatch(
  jobs: Job[],
  store: CheckpointStore,
  process: (job: Job, key: string) => Promise<void>,
): Promise<{ done: number; skipped: number; failed: number }> {
  let done = 0, skipped = 0, failed = 0;
 
  for (const job of jobs) {
    const key = idempotencyKey(job);
    const state = await store.get(key);
 
    // Silently skip completed work — this is what makes re-runs safe
    if (state === "done") { skipped++; continue; }
 
    await store.set(key, "running");
    try {
      await process(job, key);        // the real side-effecting work
      await store.set(key, "done");   // commit completion AFTER the side effect
      done++;
    } catch (err) {
      await store.set(key, "failed");
      failed++;
      // Don't stop the whole batch on one failure; leave gleaning to the next resume
      console.error(`job failed: ${key}`, err);
    }
  }
  return { done, skipped, failed };
}

No matter how often the batch dies, each re-run steadily shrinks the remainder. Die at job 3 of 10, and the next run resumes from the remaining 7; die again, and it resumes from what's left. The run is guaranteed to converge.

Stop concurrent starts with a lock

The next trap in unattended operation is a scheduler misconfiguration or a manual re-run overlapping, so two copies of the same batch run at once. Checkpoints alone leave a window where both grab the same pending state and double-process.

Declare "someone is running" with a short-lived lock key.

async function withBatchLock<T>(
  kv: KVNamespace,
  lockName: string,
  fn: () => Promise<T>,
): Promise<T | null> {
  const token = crypto.randomUUID();
  const existing = await kv.get(`lock:${lockName}`);
  if (existing) return null;   // someone is already running -> stand down this time
 
  // Auto-expires in 10 minutes so a crash doesn't leave the lock forever
  await kv.put(`lock:${lockName}`, token, { expirationTtl: 600 });
  try {
    return await fn();
  } finally {
    const cur = await kv.get(`lock:${lockName}`);
    if (cur === token) await kv.delete(`lock:${lockName}`);  // release only our own lock
  }
}

Always give the lock a TTL. A lock without one, left behind when the process dies before reaching finally, persists forever and silently skips every later batch. I once failed to notice a batch hadn't run for a full day because of this, and have used auto-expiring locks ever since.

Pitfalls from production and the operating rules

Even after the design settled, unattended operation surfaced a few more problems. Here are the rules that actually helped.

First, don't leave failed jobs unattended. Re-running with a stuck failure means the same job fails every time and only the logs grow. I added a cap: "if the same key fails three times in a row, mark it done and notify a human." Infinite retries are harmful in unattended operation.

Second, set the checkpoint TTL comfortably longer than the processing cycle. I use 30 days; shorter than that, and a month-end bulk re-run finds old done entries expired and double-processes.

Third, emit a one-line summary of done/skipped/failed counts every run, and alert only on runs where failed is non-zero. Quiet on success, loud only on anomaly — that is what lets you trust an unattended pipeline.

Before and after these changes, the overnight batch failure rate (including restarts from interruption) dropped from about 12% to 0.4%. The remaining 0.4% does almost no harm, because re-running is safe.

Trusting an agent that runs unattended takes a "won't break when it falls" foundation before it takes a clever agent. Add one idempotency key and one checkpoint to your own batch, and confirm at your desk that re-running causes no double-processing. The confidence in your operations changes from there.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.