Diagnose and recover from AgentKit 2.0 multi-agent failures—deadlocks, orchestrator loops, silent sub-agent failures, and context contamination. Includes production-ready code for fault-tolerant agent system design.
Single-agent problems are straightforward. Multi-agent problems aren't. When you move to AgentKit 2.0's orchestrated systems, a new class of failures appears: one agent waiting indefinitely for another that's waiting for it back, an orchestrator issuing the same task assignment in a loop, a sub-agent completing silently without doing anything useful, or two agents writing to shared context simultaneously and corrupting each other's data.
These failures are distinct from the runtime errors covered in other guides. They emerge from coordination—from the relationships between agents rather than from any single agent's behavior.
This guide documents seven failure patterns, how to diagnose each one from logs, how to recover in production, and how to design systems that are more resilient to these failures from the start.
The Seven Coordination Failure Patterns
Naming the pattern accurately is the fastest path to fixing it.
Pattern 1: Agent Deadlock
Agent A is waiting for Agent B to complete. Agent B is waiting for Agent A's output. Both wait indefinitely. The system appears frozen.
How to spot it: Two or more agents show waiting_for_agent status at the same timestamp in the activity log.
Pattern 2: Orchestrator Decision Loop
The Manager agent keeps generating task assignments to sub-agents, but no tasks actually progress. The same assign_task action appears repeatedly in the orchestrator's log.
Pattern 3: Silent Sub-Agent Failure
A sub-agent returns completed status but its output is empty or nonsensical. The orchestrator trusts the result and continues, compounding the error downstream.
Pattern 4: Context Contamination
Multiple agents write to shared context simultaneously. One agent's data overwrites another's, producing inconsistent or corrupted state that subsequent agents reason from incorrectly.
Pattern 5: Cascade Failure
One agent's failure propagates to dependent agents, which fail because their expected inputs never arrived, which causes their dependents to fail, and so on.
Pattern 6: Resource Contention
Multiple agents hit the same external resource—an API, file, or database—simultaneously. Rate limits are triggered, file locks conflict, or writes collide.
Pattern 7: Context Window Exhaustion
In long-running sessions, an agent's context window fills up and early instructions get truncated. Agent behavior degrades progressively as critical context is lost.
Diagnosis: Identifying the Pattern from Logs
Pull the activity log
import antigravity_sdk as ag# Enable verbose logging for developmentag.set_log_level("DEBUG")# Load and inspect sessionsession = ag.Session.load("YOUR_SESSION_ID")print(session.get_activity_log(verbose=True))# Inspect tool call history per agentfor agent_name, agent in session.agents.items(): print(f"\n=== {agent_name} ===") for call in agent.tool_calls: print(f" {call.timestamp}: {call.tool} -> {call.status}")
Deadlock signature: Multiple agents in waiting_for_agent state at the same timestamp
Loop signature: The same tool call (especially assign_task) repeated many times with the same parameters
Silent failure signature: Agent shows completed but output length is zero or near-zero
Cascade signature: Errors appear at consecutive timestamps across different agents
Check for stuck waiting states
for agent_name, agent in session.agents.items(): state = agent.get_state() print(f"{agent_name}: {state.status} (elapsed: {state.elapsed_seconds}s)") if state.status == "waiting" and state.elapsed_seconds > 300: print(f" ⚠️ Possible deadlock or timeout: waiting 5+ minutes")
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Map any multi-agent failure to one of seven documented patterns using activity logs and tool call history—so you can identify the root cause within 10 minutes instead of hours of guessing
✦Get production-ready Python code for deadlock resolution, orchestrator loop recovery, context contamination repair, and timeout/fallback configuration to meaningfully harden your AgentKit systems
✦Learn the incident response checklist and post-mortem design changes that prevent the same failures from recurring—so your multi-agent system gets more reliable over time
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
def resolve_deadlock(session_id: str, deadlocked_agents: list[str]): session = ag.Session.load(session_id) # Step 1: Force-stop deadlocked agents for agent_name in deadlocked_agents: agent = session.agents[agent_name] agent.force_stop(reason="deadlock_detected") print(f"✓ Stopped {agent_name}") # Step 2: Release all resource locks session.release_all_locks() print("✓ Released all locks") # Step 3: Restart in dependency order (non-circular) task_graph = session.get_task_dependency_graph() restart_order = task_graph.topological_sort() for task in restart_order: if task.name in deadlocked_agents: print(f"→ Restarting {task.name}...") task.restart(clear_waiting_state=True)
Fixing an orchestrator loop
def fix_orchestrator_loop(session_id: str, orchestrator_name: str): session = ag.Session.load(session_id) orchestrator = session.agents[orchestrator_name] # Diagnose the loop recent_actions = orchestrator.get_recent_actions(n=20) action_types = [a.type for a in recent_actions] from collections import Counter counts = Counter(action_types) print("Last 20 action breakdown:", counts) # If the same action appears 10+ times, it's a loop if counts.most_common(1)[0][1] >= 10: print("⚠️ Loop detected. Clearing context and restarting...") orchestrator.clear_context(preserve_initial_instructions=True) orchestrator.restart_with_new_context( additional_context="Previous task assignments have been reset. " "Check which tasks are already completed and " "resume from the first incomplete task." )
Repairing context contamination
def fix_context_contamination(session_id: str): session = ag.Session.load(session_id) # Pause all agents for agent in session.agents.values(): agent.pause() # Review write history history = session.shared_context.get_write_history() print(f"Write history ({len(history)} entries):") for entry in history[-10:]: print(f" {entry.timestamp} | {entry.agent} → {entry.key}: {str(entry.value)[:50]}") # Restore to last known-good state last_good = history[-5] # Adjust based on when contamination started session.shared_context.restore(last_good.id) print(f"✓ Context restored to {last_good.timestamp}") # Enable write locking to prevent recurrence session.shared_context.enable_write_locking() for agent in session.agents.values(): agent.resume()
Fault-Tolerant Design: Preventing Recurrence
Recovery is reactive. Good design prevents the failure in the first place.
Deadlock-safe task assignment
class DeadlockSafeOrchestrator: def __init__(self): self.task_dependencies = {} self.task_timeout = 180 # seconds def assign_task(self, agent_name: str, task: dict, depends_on: list = None): if depends_on: if self._creates_cycle(task["id"], depends_on): raise ValueError( f"Circular dependency detected: {task['id']} → {depends_on}" ) self.task_dependencies[task["id"]] = depends_on return ag.assign_task( agent=agent_name, task=task, timeout_seconds=self.task_timeout, on_timeout="fail_gracefully" ) def _creates_cycle(self, task_id: str, depends_on: list) -> bool: visited = set() def dfs(node): if node in visited: return True visited.add(node) for dep in self.task_dependencies.get(node, []): if dfs(dep): return True return False return any(dfs(dep) for dep in depends_on)
Detection (0–5 min): Confirm which agents are affected. Identify the failure pattern from logs. Assess whether the issue is expanding.
Containment (5–15 min): Stop or isolate the failing agents. Lock shared context writes if contamination is suspected.
Recovery (15–60 min): Apply the pattern-specific recovery code above. Validate in a staging environment first if possible. Resume incomplete tasks from the last stable checkpoint.
Post-incident: Identify root cause and add the fix to your agent configuration or agents.md. Add monitoring to detect the same pattern earlier next time. Document the incident and what prevented faster recovery.
A Note from an Indie Developer
Key Takeaways
Multi-agent coordination failures are harder to diagnose than single-agent errors because the problem is relational—it exists between agents, not inside any one of them. The seven patterns in this guide cover the most common forms these failures take, and the recovery code gives you a starting point rather than a blank screen.
More importantly, each incident is an opportunity to make the system more resilient. Add timeouts after a deadlock. Enable write locking after context contamination. Circular dependency checks prevent deadlocks before they start. The system gets more reliable with each improvement—which is worth keeping in mind when you're in the middle of an incident and the path forward isn't obvious.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.