ANTIGRAVITY LABJP
Articles/Agents & Manager
Agents & Manager/2026-06-15Advanced

Containing Failure in Antigravity Multi-Agent Systems: Three Boundaries That Stop Cascades

Antigravity multi-agent setups run beautifully in isolation but cascade in production, where one small failure drags the whole orchestration down. These notes organize the fix around three boundaries—layered control, trust separation, and observability with idempotency—down to the TOML and the correlation-ID wrapper.

antigravity355multi-agent41orchestration17resilience7observability13production64agentkit13

Premium Article

A multi-agent setup that ran flawlessly in your prototype starts behaving differently the moment you send the same request a hundred times in production. Every so often, one agent's failure pulls the orchestrator down with it, and the remaining parallel tasks stall like an avalanche. Or one agent writes a wrong assumption into shared memory, and every downstream agent treats it as ground truth. These symptoms almost never reproduce in unit tests. They only surface once load and input diversity cross a threshold, at which point every gap in the design opens at once.

What I want to organize here is not a catalog of individual bug fixes. Running multiple agents on Antigravity as an indie developer taught me that nearly every production incident reduces to one question of containment: how far does a single failure spread? The causes number in the dozens, but the designs that actually work collapse into three boundaries—layered control, separation of trust and write access, and observability paired with idempotency. Draw these three as boundaries up front, and even a brand-new failure mode sends you back to the same place in the design to fix it.

Why unit tests don't catch this

Multi-agent failures almost always surface when several independent events coincide. A deterministically failing input slips in, another agent burns tokens in extended-thinking mode, and behind both of them a tool call times out. Each one is harmless alone, but stacked together they amplify each other through interaction.

Unit tests don't reproduce that stacking. Inputs are clean, parallelism is low, and external dependencies are mocked. So rather than treating production-only failures as "unexpected," you draw boundaries that assume coincidence from the start. Containment isn't about driving failures to zero—it's about deciding the blast radius of a failure at design time.

Boundary 1: Nest control so a lower level never exceeds the upper

The first boundary nests three kinds of control—retries, timeouts, and token budgets—into a strict hierarchy. When this inverts, the orchestrator above times out while a sub-agent below is still working, and the partial result you'd already earned is thrown away.

Timeouts grow longer toward the outside. Concretely, give the orchestrator 30 minutes, each sub-agent 10 minutes, and each tool call 2 minutes, so the inner always fits inside the outer. Holding this order alone makes lower-level failures propagate upward correctly.

Retries don't stop on attempt count alone. A deterministically failing input never succeeds no matter how many times you try, so you pair the count with a hard wall-clock limit and a circuit breaker.

[agents.researcher.retry_policy]
max_attempts = 5
initial_delay_ms = 1000
max_delay_ms = 30000
total_timeout_ms = 600000        # caps total time even as exponential backoff stretches intervals
circuit_breaker_threshold = 3    # trip after 3 same-type errors in a short window
circuit_breaker_window_ms = 120000

The reason for total_timeout_ms is that backoff alone can make you wait far longer than intended before the final retry. Don't stop on count alone—stop on time too. The circuit breaker temporarily rejects a task type once the same error repeats, preventing it from looping forever.

Token budgets follow the same logic: throttle parallelism with a semaphore on the orchestrator side, and cap each agent individually. Parallelism is the appeal, but running five agents at once simply consumes five times the tokens—and if each has extended thinking enabled, consumption grows nonlinearly.

[orchestrator]
max_parallel_agents = 3
token_budget_per_agent = 8000
thinking_budget_per_agent = 4000

Work backward from parallelism so the combined budget can't exceed the project quota. Before going to production, I work out by hand the worst case—every agent using its full cap—and only then lock the numbers in.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
TOML for nesting retries, timeouts, and token budgets so a lower-level failure never drags the orchestrator down with it
An implementation pattern that contains context poisoning and prompt injection through write-permission separation and provenance
A wrapper that bakes in correlation IDs, idempotency keys, and five core metrics from day one—with the order to roll them out
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Agents & Manager2026-04-27
Building Self-Healing Antigravity Agents — Detection, Diagnosis, and Recovery in Production
A practical three-layer pattern for keeping Antigravity agents alive in production: signal-based detection, deterministic diagnosis, and graduated recovery — with full AgentKit 2.0 code and the production traps I learned the hard way.
Agents & Manager2026-04-10
Antigravity Multi-Agent Orchestration Guide: From Communication Errors to Production
Complete guide to designing and implementing multi-agent systems with Antigravity. Covers architecture patterns, communication error troubleshooting, and production stability.
Agents & Manager2026-03-26
Multi-Agent Development with Antigravity — Building Autonomous AI Teams with AgentKit
Deep dive into AgentKit 2.0 multi-agent design patterns. 5 orchestration strategies, runaway prevention, cost control, and production-ready templates.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →