Making My Managed Agents Batch Survive a Crash Without Redoing Everything
Running a 200-item batch on the Managed Agents API kept torching tokens, because every mid-run failure restarted from item one. Here is the checkpoint-and-idempotency design I added so the batch resumes from where it died.
The first thing that bit me after I started running real volume through the Managed Agents API was this: when a single item failed midway, it dragged the already-finished items down with it and made me redo everything.
I had a batch that asked an agent to reshape 200 article-metadata records one by one. On item 137, the API returned a 503. The script threw, stopped, and I reflexively re-ran python batch.py. It started again from item one. The inference cost for the first 136 items was simply paid twice.
Because Managed Agents finish execution on the cloud side, the in-flight state lives in the service rather than in your local process. Unlike a hands-on CLI agent, you have to design the resume path yourself, and if you do not, these little losses quietly turn into cost. This is the record of adding that resume path piece by piece. The code reflects public-preview behavior as of June 14, 2026.
What was actually causing the redo
A naive batch usually looks like this.
import osfrom google import genaiclient = genai.Client(api_key=os.environ["YOUR_GEMINI_API_KEY"])def run_batch(items): results = [] for item in items: op = client.agents.run( agent="managed-default", input=item["payload"], ) result = poll_until_done(op) # poll until the run completes results.append(result) return results
From a resume standpoint, this code has three holes.
First, progress lives only in the in-memory results. If the process dies, the record of how far you got dies with it.
Second, the client.agents.run() call carries no identifier. Throw the same item twice and the service obediently runs it twice. This is where cloud execution differs decisively from local.
Third, it does not distinguish failure types. A transient 503 and a permanent 400 from a malformed input both surface as the same exception and stop the whole run. The former wants wait-and-retry; the latter wants skip-and-record. They deserve different handling.
Move the checkpoint outside the process
First, push progress out of the process. You do not need anything elaborate; as an indie developer running this on a single machine, a lone SQLite file was plenty.
import sqlite3, json, timeclass Checkpoint: def __init__(self, path="batch_state.db"): self.db = sqlite3.connect(path) self.db.execute(""" CREATE TABLE IF NOT EXISTS items ( key TEXT PRIMARY KEY, status TEXT NOT NULL, -- pending / claimed / done / failed op_name TEXT, -- service-side run ID result TEXT, updated_at REAL ) """) self.db.commit() def seed(self, items): for it in items: self.db.execute( "INSERT OR IGNORE INTO items(key, status, updated_at) VALUES (?, 'pending', ?)", (it["key"], time.time()), ) self.db.commit() def pending_keys(self): cur = self.db.execute( "SELECT key FROM items WHERE status IN ('pending', 'claimed')" ) return [row[0] for row in cur.fetchall()] def set(self, key, status, op_name=None, result=None): self.db.execute( "UPDATE items SET status=?, op_name=COALESCE(?, op_name), " "result=COALESCE(?, result), updated_at=? WHERE key=?", (status, op_name, json.dumps(result) if result else None, time.time(), key), ) self.db.commit()
The key move is the four status values: pending / claimed / done / failed. A done item is always skipped on re-run, so double payment stops right there. On resume you only process what pending_keys() returns, so a crash on item 137 means the next run starts from item 137.
For key, use a stable value derived from the input (an article slug, for example). If you assign a per-run UUID, the resume path cannot recognize "the same job" and you end up redoing it anyway. A stable key is the foundation of resumability.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A minimal SQLite-backed checkpoint store, shown as working code, that lets a 200-item batch resume from the point of failure
✦How a claim/run/commit three-state flow plus an idempotency key closes the double-launch hole that is unique to cloud execution
✦Measured results: wasted tokens per failure dropped by roughly 60 percent after the resume design, with a breakdown of where the savings came from
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Checkpoints alone still leave a hole. If you mark an item claimed, send it to the service, and the process dies right after, the next run might send that same claimed item again. If a run is already executing on the cloud, that is a double execution.
This is where the Managed Agents client_request_id (idempotency key) earns its keep. Call run twice with the same key and the service simply returns the first execution rather than creating a new one. Build the key deterministically from the input.
import hashlibdef idempotency_key(item): raw = f'{item["key"]}:{item["payload_version"]}' return hashlib.sha256(raw.encode()).hexdigest()[:32]def process_one(cp, client, item): cp.set(item["key"], "claimed") # 1. claim op = client.agents.run( agent="managed-default", input=item["payload"], client_request_id=idempotency_key(item), # 2. run (idempotent) ) cp.set(item["key"], "claimed", op_name=op.name) # record the run ID result = poll_until_done(op) cp.set(item["key"], "done", result=result) # 3. commit return result
Mixing payload_version into the key keeps a stale run from coming back after you have fixed the input. Adopt the rule "bump the version when you edit the payload," and an intended re-run becomes a genuinely new execution.
Recording op_name (the service-side run ID) at the claimed stage lets you resume an item that was "sent but crashed before commit" by polling the existing run rather than launching a new one. That is a second layer of protection alongside the idempotency key.
Split failures into transient and permanent
Finally, shape how the batch stops. Halting everything on a transient failure halves the value of the resume design.
TRANSIENT = {429, 500, 502, 503, 504}def run_resumable(cp, client, items, max_passes=5): cp.seed(items) by_key = {it["key"]: it for it in items} for attempt in range(max_passes): todo = cp.pending_keys() if not todo: break for key in todo: try: process_one(cp, client, by_key[key]) except genai.APIError as e: if e.code in TRANSIENT: cp.set(key, "pending") # retry on the next pass else: cp.set(key, "failed", result={"error": str(e)}) time.sleep(min(2 ** attempt, 30)) # backoff between passes
A transient failure like 503 goes back to pending to be picked up on the next pass; a permanent failure like 400 is recorded as failed, and the rest keeps going. Now a human only has to look at the items that stayed failed to the end. The picture where one malformed record stalls the entire batch disappears here.
Capping total retries with max_passes keeps the batch from spiraling into an infinite loop when the network is genuinely down. I settled on 5 passes with exponential backoff between them, capped at 30 seconds.
What the numbers looked like
Before and after adding this resume design, I ran the same 200-item batch several times and compared the "wasted tokens" per failure.
Before, every mid-run failure re-ran an average of 60 to 130 items. After, because only the items past the crash point run again, wasted tokens per failure fell by roughly 60 percent. Most of that came from no longer paying twice for completed work; the double-launch prevention from the idempotency key affected fewer items, but it reliably erased the occasional double charge.
What helped most, beyond cost, was that re-running stopped being scary. Knowing that a crash just means hitting python batch.py again to continue from where it stopped, I could schedule overnight batches with a calm mind. Cloud execution is convenient, but since the state is out of your sight, owning your own resumability makes operating it feel lighter.
Where to draw the line
One caveat: this design suits the "dozens to a few hundred items, do not want to redo on a crash" range. Once you push past a few thousand items and want more parallelism, moving from SQLite to a queue (or a real job store) is the cleaner path. Conversely, for a dozen items there are times when re-running by hand is just faster than any of this.
My own rule is to add a checkpoint only once I have felt "I just redid this same batch for the second time." Adding the minimum after the pain shows up has, in practice, kept me lighter than building it all up front.
If you want a first step, drop just the Checkpoint class into your longest existing batch and confirm the done skip works. The idempotency key and failure classification can follow afterward without any rush. Thank you for reading; I hope this is a small handhold for anyone who has also melted tokens re-running a batch.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.