Articles/Agents & Manager

◈ Agents & Manager/2026-04-09Intermediate

AgentKit 2.0 Multi-Agent Collaboration Failures: Complete Recovery Guide

Diagnose and recover from AgentKit 2.0 multi-agent failures—deadlocks, orchestrator loops, silent sub-agent failures, and context contamination. Includes production-ready code for fault-tolerant agent system design.

AgentKit 2.0¹³ multi-agent⁵⁰ troubleshooting¹⁰⁸ production⁷¹ recovery³ deadlock orchestration²¹

✦ Premium Article

Single-agent problems are straightforward. Multi-agent problems aren't. When you move to AgentKit 2.0's orchestrated systems, a new class of failures appears: one agent waiting indefinitely for another that's waiting for it back, an orchestrator issuing the same task assignment in a loop, a sub-agent completing silently without doing anything useful, or two agents writing to shared context simultaneously and corrupting each other's data.

These failures are distinct from the runtime errors covered in other guides. They emerge from coordination—from the relationships between agents rather than from any single agent's behavior.

This guide documents seven failure patterns, how to diagnose each one from logs, how to recover in production, and how to design systems that are more resilient to these failures from the start.

The Seven Coordination Failure Patterns

Naming the pattern accurately is the fastest path to fixing it.

Pattern 1: Agent Deadlock

Agent A is waiting for Agent B to complete. Agent B is waiting for Agent A's output. Both wait indefinitely. The system appears frozen.

How to spot it: Two or more agents show waiting_for_agent status at the same timestamp in the activity log.

Pattern 2: Orchestrator Decision Loop

The Manager agent keeps generating task assignments to sub-agents, but no tasks actually progress. The same assign_task action appears repeatedly in the orchestrator's log.

Pattern 3: Silent Sub-Agent Failure

A sub-agent returns completed status but its output is empty or nonsensical. The orchestrator trusts the result and continues, compounding the error downstream.

Pattern 4: Context Contamination

Multiple agents write to shared context simultaneously. One agent's data overwrites another's, producing inconsistent or corrupted state that subsequent agents reason from incorrectly.

Pattern 5: Cascade Failure

One agent's failure propagates to dependent agents, which fail because their expected inputs never arrived, which causes their dependents to fail, and so on.

Pattern 6: Resource Contention

Multiple agents hit the same external resource—an API, file, or database—simultaneously. Rate limits are triggered, file locks conflict, or writes collide.

Pattern 7: Context Window Exhaustion

In long-running sessions, an agent's context window fills up and early instructions get truncated. Agent behavior degrades progressively as critical context is lost.

Diagnosis: Identifying the Pattern from Logs

Pull the activity log

import antigravity_sdk as ag
 
# Enable verbose logging for development
ag.set_log_level("DEBUG")
 
# Load and inspect session
session = ag.Session.load("YOUR_SESSION_ID")
print(session.get_activity_log(verbose=True))
 
# Inspect tool call history per agent
for agent_name, agent in session.agents.items():
    print(f"\n=== {agent_name} ===")
    for call in agent.tool_calls:
        print(f"  {call.timestamp}: {call.tool} -> {call.status}")

Deadlock signature: Multiple agents in waiting_for_agent state at the same timestamp

Loop signature: The same tool call (especially assign_task) repeated many times with the same parameters

Silent failure signature: Agent shows completed but output length is zero or near-zero

Cascade signature: Errors appear at consecutive timestamps across different agents

Check for stuck waiting states

for agent_name, agent in session.agents.items():
    state = agent.get_state()
    print(f"{agent_name}: {state.status} (elapsed: {state.elapsed_seconds}s)")
    if state.status == "waiting" and state.elapsed_seconds > 300:
        print(f"  ⚠️ Possible deadlock or timeout: waiting 5+ minutes")

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Map any multi-agent failure to one of seven documented patterns using activity logs and tool call history—so you can identify the root cause within 10 minutes instead of hours of guessing

✦Get production-ready Python code for deadlock resolution, orchestrator loop recovery, context contamination repair, and timeout/fallback configuration to meaningfully harden your AgentKit systems

✦Learn the incident response checklist and post-mortem design changes that prevent the same failures from recurring—so your multi-agent system gets more reliable over time

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Recovery: Pattern-by-Pattern Procedures

Resolving a deadlock

def resolve_deadlock(session_id: str, deadlocked_agents: list[str]):
    session = ag.Session.load(session_id)
 
    # Step 1: Force-stop deadlocked agents
    for agent_name in deadlocked_agents:
        agent = session.agents[agent_name]
        agent.force_stop(reason="deadlock_detected")
        print(f"✓ Stopped {agent_name}")
 
    # Step 2: Release all resource locks
    session.release_all_locks()
    print("✓ Released all locks")
 
    # Step 3: Restart in dependency order (non-circular)
    task_graph = session.get_task_dependency_graph()
    restart_order = task_graph.topological_sort()
 
    for task in restart_order:
        if task.name in deadlocked_agents:
            print(f"→ Restarting {task.name}...")
            task.restart(clear_waiting_state=True)

Fixing an orchestrator loop

def fix_orchestrator_loop(session_id: str, orchestrator_name: str):
    session = ag.Session.load(session_id)
    orchestrator = session.agents[orchestrator_name]
 
    # Diagnose the loop
    recent_actions = orchestrator.get_recent_actions(n=20)
    action_types = [a.type for a in recent_actions]
 
    from collections import Counter
    counts = Counter(action_types)
    print("Last 20 action breakdown:", counts)
 
    # If the same action appears 10+ times, it's a loop
    if counts.most_common(1)[0][1] >= 10:
        print("⚠️ Loop detected. Clearing context and restarting...")
        orchestrator.clear_context(preserve_initial_instructions=True)
        orchestrator.restart_with_new_context(
            additional_context="Previous task assignments have been reset. "
                               "Check which tasks are already completed and "
                               "resume from the first incomplete task."
        )

Repairing context contamination

def fix_context_contamination(session_id: str):
    session = ag.Session.load(session_id)
 
    # Pause all agents
    for agent in session.agents.values():
        agent.pause()
 
    # Review write history
    history = session.shared_context.get_write_history()
    print(f"Write history ({len(history)} entries):")
    for entry in history[-10:]:
        print(f"  {entry.timestamp} | {entry.agent} → {entry.key}: {str(entry.value)[:50]}")
 
    # Restore to last known-good state
    last_good = history[-5]  # Adjust based on when contamination started
    session.shared_context.restore(last_good.id)
    print(f"✓ Context restored to {last_good.timestamp}")
 
    # Enable write locking to prevent recurrence
    session.shared_context.enable_write_locking()
    for agent in session.agents.values():
        agent.resume()

Fault-Tolerant Design: Preventing Recurrence

Recovery is reactive. Good design prevents the failure in the first place.

Deadlock-safe task assignment

class DeadlockSafeOrchestrator:
    def __init__(self):
        self.task_dependencies = {}
        self.task_timeout = 180  # seconds
 
    def assign_task(self, agent_name: str, task: dict, depends_on: list = None):
        if depends_on:
            if self._creates_cycle(task["id"], depends_on):
                raise ValueError(
                    f"Circular dependency detected: {task['id']} → {depends_on}"
                )
            self.task_dependencies[task["id"]] = depends_on
 
        return ag.assign_task(
            agent=agent_name,
            task=task,
            timeout_seconds=self.task_timeout,
            on_timeout="fail_gracefully"
        )
 
    def _creates_cycle(self, task_id: str, depends_on: list) -> bool:
        visited = set()
        def dfs(node):
            if node in visited:
                return True
            visited.add(node)
            for dep in self.task_dependencies.get(node, []):
                if dfs(dep):
                    return True
            return False
        return any(dfs(dep) for dep in depends_on)

Timeout and fallback configuration

AGENT_CONFIG = {
    "data_collector": {
        "timeout_seconds": 120,
        "retry_count": 2,
        "on_failure": "skip_and_continue",
        "fallback_value": {"status": "skipped", "data": []}
    },
    "analyzer": {
        "timeout_seconds": 300,
        "retry_count": 1,
        "on_failure": "escalate_to_human",
        "fallback_value": None
    },
    "reporter": {
        "timeout_seconds": 60,
        "retry_count": 3,
        "on_failure": "use_template",
        "fallback_value": "Analysis unavailable due to upstream error."
    }
}

Context length management

class ContextManager:
    MAX_CONTEXT_TOKENS = 100_000
 
    def __init__(self, agent):
        self.agent = agent
        self.message_count = 0
 
    def add_message(self, message: str):
        self.agent.context.append(message)
        self.message_count += 1
        if self.message_count % 20 == 0:
            self._compress_if_needed()
 
    def _compress_if_needed(self):
        current_tokens = self.agent.estimate_context_tokens()
        if current_tokens > self.MAX_CONTEXT_TOKENS * 0.8:
            old_messages = self.agent.context[:-10]
            summary = self.agent.summarize(old_messages)
            self.agent.context = [
                f"[Summary of previous work] {summary}"
            ] + self.agent.context[-10:]
            print(f"Context compressed: {len(old_messages)} messages → 1 summary")

Production Incident Response Checklist

Detection (0–5 min): Confirm which agents are affected. Identify the failure pattern from logs. Assess whether the issue is expanding.

Containment (5–15 min): Stop or isolate the failing agents. Lock shared context writes if contamination is suspected.

Recovery (15–60 min): Apply the pattern-specific recovery code above. Validate in a staging environment first if possible. Resume incomplete tasks from the last stable checkpoint.

Post-incident: Identify root cause and add the fix to your agent configuration or agents.md. Add monitoring to detect the same pattern earlier next time. Document the incident and what prevented faster recovery.

A Note from an Indie Developer

Key Takeaways

Multi-agent coordination failures are harder to diagnose than single-agent errors because the problem is relational—it exists between agents, not inside any one of them. The seven patterns in this guide cover the most common forms these failures take, and the recovery code gives you a starting point rather than a blank screen.

More importantly, each incident is an opportunity to make the system more resilient. Add timeouts after a deadlock. Enable write locking after context contamination. Circular dependency checks prevent deadlocks before they start. The system gets more reliable with each improvement—which is worth keeping in mind when you're in the middle of an incident and the path forward isn't obvious.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.