ANTIGRAVITY LABJP
Articles/Agents & Manager
Agents & Manager/2026-06-14Advanced

Making My Managed Agents Batch Survive a Crash Without Redoing Everything

Running a 200-item batch on the Managed Agents API kept torching tokens, because every mid-run failure restarted from item one. Here is the checkpoint-and-idempotency design I added so the batch resumes from where it died.

Antigravity230Managed Agents2Agents10IdempotencyCheckpointPython13

Premium Article

The first thing that bit me after I started running real volume through the Managed Agents API was this: when a single item failed midway, it dragged the already-finished items down with it and made me redo everything.

I had a batch that asked an agent to reshape 200 article-metadata records one by one. On item 137, the API returned a 503. The script threw, stopped, and I reflexively re-ran python batch.py. It started again from item one. The inference cost for the first 136 items was simply paid twice.

Because Managed Agents finish execution on the cloud side, the in-flight state lives in the service rather than in your local process. Unlike a hands-on CLI agent, you have to design the resume path yourself, and if you do not, these little losses quietly turn into cost. This is the record of adding that resume path piece by piece. The code reflects public-preview behavior as of June 14, 2026.

What was actually causing the redo

A naive batch usually looks like this.

import os
from google import genai
 
client = genai.Client(api_key=os.environ["YOUR_GEMINI_API_KEY"])
 
def run_batch(items):
    results = []
    for item in items:
        op = client.agents.run(
            agent="managed-default",
            input=item["payload"],
        )
        result = poll_until_done(op)  # poll until the run completes
        results.append(result)
    return results

From a resume standpoint, this code has three holes.

First, progress lives only in the in-memory results. If the process dies, the record of how far you got dies with it.

Second, the client.agents.run() call carries no identifier. Throw the same item twice and the service obediently runs it twice. This is where cloud execution differs decisively from local.

Third, it does not distinguish failure types. A transient 503 and a permanent 400 from a malformed input both surface as the same exception and stop the whole run. The former wants wait-and-retry; the latter wants skip-and-record. They deserve different handling.

Move the checkpoint outside the process

First, push progress out of the process. You do not need anything elaborate; as an indie developer running this on a single machine, a lone SQLite file was plenty.

import sqlite3, json, time
 
class Checkpoint:
    def __init__(self, path="batch_state.db"):
        self.db = sqlite3.connect(path)
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS items (
                key TEXT PRIMARY KEY,
                status TEXT NOT NULL,        -- pending / claimed / done / failed
                op_name TEXT,                -- service-side run ID
                result TEXT,
                updated_at REAL
            )
        """)
        self.db.commit()
 
    def seed(self, items):
        for it in items:
            self.db.execute(
                "INSERT OR IGNORE INTO items(key, status, updated_at) VALUES (?, 'pending', ?)",
                (it["key"], time.time()),
            )
        self.db.commit()
 
    def pending_keys(self):
        cur = self.db.execute(
            "SELECT key FROM items WHERE status IN ('pending', 'claimed')"
        )
        return [row[0] for row in cur.fetchall()]
 
    def set(self, key, status, op_name=None, result=None):
        self.db.execute(
            "UPDATE items SET status=?, op_name=COALESCE(?, op_name), "
            "result=COALESCE(?, result), updated_at=? WHERE key=?",
            (status, op_name, json.dumps(result) if result else None, time.time(), key),
        )
        self.db.commit()

The key move is the four status values: pending / claimed / done / failed. A done item is always skipped on re-run, so double payment stops right there. On resume you only process what pending_keys() returns, so a crash on item 137 means the next run starts from item 137.

For key, use a stable value derived from the input (an article slug, for example). If you assign a per-run UUID, the resume path cannot recognize "the same job" and you end up redoing it anyway. A stable key is the foundation of resumability.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A minimal SQLite-backed checkpoint store, shown as working code, that lets a 200-item batch resume from the point of failure
How a claim/run/commit three-state flow plus an idempotency key closes the double-launch hole that is unique to cloud execution
Measured results: wasted tokens per failure dropped by roughly 60 percent after the resume design, with a breakdown of where the savings came from
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Agents & Manager2026-06-12
Running Gemini's Managed Agents API: Where Cloud Execution Ends and My Local Agents Begin
A hands-on record of launching Gemini's Managed Agents (public preview) from Python — polling, artifact retrieval, and a cost guard — plus five criteria I use to decide what stays on my local CLI agents.
Agents & Manager2026-04-08
Antigravity AgentKit 2.0 Runtime Errors: Complete Troubleshooting Guide — tool_call Failures, Infinite Loops, and Context Overflow
A comprehensive guide to diagnosing and fixing AgentKit 2.0 runtime errors in production: tool_call failures, infinite loop detection, context window overflow, parallel agent sync errors, and graceful degradation patterns — all with working code.
Agents & Manager2026-04-29
Teaching Antigravity Agents to Learn from Failure — A Solo Developer's Loop for Reusing Failure History
Antigravity agents repeat the same mistakes because each session starts blank. A solo developer's six-month run with a structured failure log, a separate observer agent, and the side-effect of overfitting.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →