Containing Failure in Antigravity Multi-Agent Systems: Three Boundaries That Stop Cascades

Antigravity multi-agent setups run beautifully in isolation but cascade in production, where one small failure drags the whole orchestration down. These notes organize the fix around three boundaries—layered control, trust separation, and observability with idempotency—down to the TOML and the correlation-ID wrapper.

antigravity⁴³⁶ multi-agent⁵⁰ orchestration²¹ resilience⁹ observability¹⁹ production⁷¹ agentkit¹³

✦ Premium Article

A multi-agent setup that ran flawlessly in your prototype starts behaving differently the moment you send the same request a hundred times in production. Every so often, one agent's failure pulls the orchestrator down with it, and the remaining parallel tasks stall like an avalanche. Or one agent writes a wrong assumption into shared memory, and every downstream agent treats it as ground truth. These symptoms almost never reproduce in unit tests. They only surface once load and input diversity cross a threshold, at which point every gap in the design opens at once.

What I want to organize here is not a catalog of individual bug fixes. Running multiple agents on Antigravity as an indie developer taught me that nearly every production incident reduces to one question of containment: how far does a single failure spread? The causes number in the dozens, but the designs that actually work collapse into three boundaries—layered control, separation of trust and write access, and observability paired with idempotency. Draw these three as boundaries up front, and even a brand-new failure mode sends you back to the same place in the design to fix it.

Why unit tests don't catch this

Multi-agent failures almost always surface when several independent events coincide. A deterministically failing input slips in, another agent burns tokens in extended-thinking mode, and behind both of them a tool call times out. Each one is harmless alone, but stacked together they amplify each other through interaction.

Unit tests don't reproduce that stacking. Inputs are clean, parallelism is low, and external dependencies are mocked. So rather than treating production-only failures as "unexpected," you draw boundaries that assume coincidence from the start. Containment isn't about driving failures to zero—it's about deciding the blast radius of a failure at design time.

Boundary 1: Nest control so a lower level never exceeds the upper

The first boundary nests three kinds of control—retries, timeouts, and token budgets—into a strict hierarchy. When this inverts, the orchestrator above times out while a sub-agent below is still working, and the partial result you'd already earned is thrown away.

Timeouts grow longer toward the outside. Concretely, give the orchestrator 30 minutes, each sub-agent 10 minutes, and each tool call 2 minutes, so the inner always fits inside the outer. Holding this order alone makes lower-level failures propagate upward correctly.

Retries don't stop on attempt count alone. A deterministically failing input never succeeds no matter how many times you try, so you pair the count with a hard wall-clock limit and a circuit breaker.

[agents.researcher.retry_policy]
max_attempts = 5
initial_delay_ms = 1000
max_delay_ms = 30000
total_timeout_ms = 600000        # caps total time even as exponential backoff stretches intervals
circuit_breaker_threshold = 3    # trip after 3 same-type errors in a short window
circuit_breaker_window_ms = 120000

The reason for total_timeout_ms is that backoff alone can make you wait far longer than intended before the final retry. Don't stop on count alone—stop on time too. The circuit breaker temporarily rejects a task type once the same error repeats, preventing it from looping forever.

Token budgets follow the same logic: throttle parallelism with a semaphore on the orchestrator side, and cap each agent individually. Parallelism is the appeal, but running five agents at once simply consumes five times the tokens—and if each has extended thinking enabled, consumption grows nonlinearly.

[orchestrator]
max_parallel_agents = 3
token_budget_per_agent = 8000
thinking_budget_per_agent = 4000

Work backward from parallelism so the combined budget can't exceed the project quota. Before going to production, I work out by hand the worst case—every agent using its full cap—and only then lock the numbers in.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦TOML for nesting retries, timeouts, and token budgets so a lower-level failure never drags the orchestrator down with it

✦An implementation pattern that contains context poisoning and prompt injection through write-permission separation and provenance

✦A wrapper that bakes in correlation IDs, idempotency keys, and five core metrics from day one—with the order to roll them out

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Boundary 2: Separate trust from write access

The second boundary makes explicit, at design time, who can write what and which external data you trust. Leave this implicit and poisoning and injection creep in later.

Letting every agent write to shared memory looks convenient but breeds poisoning chains. A research agent writes a wrong fact, an implementation agent builds on it, and a review agent declares that implementation correct. Tracing where the first mistake entered is extremely hard.

The fix is to allow reads for everyone but restrict writes to a single dedicated recorder agent—and to require, on every write, the provenance: which tool call produced this. AgentKit 2.0's memory API lets you attach a provenance field to each entry; making it a mandatory operating rule lets you trace, when poisoning occurs, exactly which tool it came from.

When handling external data, sharpen the trust boundary one more notch. Web pages, user input, and other APIs' responses genuinely contain strings like "ignore the previous instructions and do X." This isn't theoretical—it happens routinely in agents that summarize news or handle user posts. Defense is layered.

UNTRUSTED_WRAPPER = (
    "--- BEGIN UNTRUSTED DATA ---\n"
    "{payload}\n"
    "--- END UNTRUSTED DATA ---"
)
 
def wrap_external(payload: str) -> str:
    # Isolate external data with delimiter tokens. The system prompt must
    # state that instructions inside these delimiters are not to be followed.
    return UNTRUSTED_WRAPPER.format(payload=payload.replace("---", "—"))

Wrap with delimiter tokens, declare in the system prompt that instructions inside the delimiters won't be followed, and restrict the tools callable from external data via sandbox permissions. Only with these three layers does injection's success rate drop to a practical level.

When agents can call one another, watch for cycles too: A calls B, B calls C, C calls A, and you have infinite recursion. Statically constrain the call graph to a DAG, and as a runtime backstop add a maximum call depth (say, 5 levels).

Boundary 3: Bake in observability and idempotency from the start

The third boundary prepares visibility and re-execution safety before you run, not after. It looks like extra work, but it cuts incident resolution time by an order of magnitude.

The hardest part of debugging multi-agent systems is log scatter. With five agents running in parallel, the console mixes five conversations and you can't follow the timeline. The fix is to generate a correlation_id at task intake and propagate it to every sub-agent and tool call. AgentKit 2.0 auto-inherits context.correlation_id into child agents, but putting the ID into custom-tool logs is the implementer's job. Wrap tool calls to always inject the ID and an idempotency key.

import uuid, time, logging
 
log = logging.getLogger("agent")
 
def tool_call(name, fn, args, *, correlation_id, idempotency_store):
    cid = correlation_id
    # Derive a key from the task content to reject double execution on external retry.
    key = f"{name}:{hash(frozenset(args.items()))}"
    if key in idempotency_store:
        log.info("skip duplicate", extra={"cid": cid, "tool": name})
        return idempotency_store[key]
    started = time.monotonic()
    try:
        result = fn(**args, idempotency_key=key)
        idempotency_store[key] = result
        return result
    finally:
        log.info("tool_done", extra={
            "cid": cid, "tool": name,
            "latency_ms": int((time.monotonic() - started) * 1000),
        })

The idempotency key matters for side-effecting operations. File creation, writes to external APIs, and billing that run twice on an external retry from an orchestrator failure cause real harm. Issue a key per task, and have side-effecting tools record it server-side to reject duplicates. In Antigravity tool definitions, prefer tools that accept an idempotency_key argument.

For observability, instrument these five from the start: per-agent average response time, per-tool-call average response time, retry rate and where it occurs, token consumption split into thinking/input/output, and circuit-breaker trip count. Export them to Prometheus and visualize in Grafana, and when a new problem appears you can pinpoint "what's slow" in 30 seconds. Realistically, have these in place within the first week of operation.

How to integrate so cascades stop

With the three boundaries drawn, settle the orchestrator's aggregation policy last. Canceling all remaining work the instant one parallel task fails—fail-fast—looks reasonable but discards every partial success, which is expensive. Letting each agent run to completion independently and aggregating only the successes is more practical.

[orchestrator]
aggregation_policy = "best_effort"
minimum_success_ratio = 0.6   # treat 60% success as overall success; raise it for critical tasks

Set minimum_success_ratio too low and quality drops, so tune it by task importance. Push it to 0.9 for high-stakes aggregation and leave it at 0.6 for exploratory research—that split works well.

Cost blowups are a kind of cascade too. A workflow that cost a dollar in development easily costs ten times that on the production input-size distribution. Estimate cost from input size before each task, and fall back to a cheaper model past a threshold. A two-stage setup—first pass with gemini-2.5-flash, deeper work only where needed with gemini-2.5-pro—cuts cost substantially while holding quality.

The order to roll this out

You don't need all three boundaries at once. The order I recommend starts with observability. Correlation IDs and the core metrics become the map that tells you which boundary is missing. Next, add layered control to shrink the cascade's blast radius, and finally seal off side effects and poisoning with trust boundaries and idempotency. Start with the other measures and skip observability, and you'll keep tweaking settings without knowing whether they help.

A multi-agent setup becomes genuinely reliable once you decide a failure's blast radius in the design. The next time it stalls in production, before chasing symptoms one by one, first check which of the three boundaries has broken. The distance from a working prototype to a broken production system is shorter than you think.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.