Building Self-Healing Antigravity Agents — Detection, Diagnosis, and Recovery in Production
A practical three-layer pattern for keeping Antigravity agents alive in production: signal-based detection, deterministic diagnosis, and graduated recovery — with full AgentKit 2.0 code and the production traps I learned the hard way.
3:14 a.m. The Slack notification woke me up: "The payment-support agent isn't responding. 27 tickets queued."
I opened my laptop, half-asleep. Antigravity's Manager Surface was scrolling tool_call_failed: stripe.list_invoices over and over. The cause: a brief Stripe rate-limit. The agent had failed once, given up, and returned the same error to every ticket since.
That night I realised something that changed how I build agents: if your agent can't recover from common failures on its own, your sleep will keep paying the bill. This article is the design I rebuilt over the next six months — concrete code that runs inside Antigravity's AgentKit 2.0, plus the traps I walked into in production.
Why "self-healing" is the right framing
Once you run agents in production, the first uncomfortable truth shows up quickly: most failures aren't bugs in your code, they're transient hiccups in dependencies you don't control.
I categorised the 1,247 errors my three production agents (payment support, code review, SEO reporting) hit over the last 90 days. The breakdown was:
External API rate limits and timeouts: 671 (53.8%)
So nine out of ten failures were the kind that fix themselves if you wait, or fix themselves if you switch to a different path. And yet my agent went silent the moment any of them hit.
The instinct here is to "just add retries". Don't. Naive retries cause cascades. Retrying five times against a rate-limited Stripe API extends the penalty. Retrying against a real bug multiplies the same error 100x in your logs.
What you actually need is an agent that can diagnose what's happening and choose a recovery strategy that fits. I call this a self-healing agent.
The three-layer model: detection, diagnosis, recovery
The design I landed on splits responsibility into three layers. Resist the urge to put it all in one giant try-except:
Layer 1: Detection — answers "is the agent currently healthy?" by collecting health signals
Layer 2: Diagnosis — answers "what kind of failure is this?" by classifying the error and choosing a strategy
Layer 3: Recovery — answers "what do we do about it?" by executing a specific recovery — retry, failover, degrade, circuit-break
The split matters because each layer can be tested and improved independently. Tweaking detection alone lets you catch silent failures (the agent returns nonsense but no error). Adding new patterns to diagnosis lets you handle a new class of outage without touching recovery code.
The deeper reason: recovery code is the code you least want to change at 3 a.m. If you push "let's just bump the retry count to 10" while you're tired, you'll cause a different incident the next day. With detection and diagnosis as separate layers, you can absorb new failure shapes upstream and leave recovery stable.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦If your agent has been silently failing in production while users wait, you'll be able to cut those silent incidents to near-zero with a three-layer detection pipeline
✦You'll learn how to wire detection, diagnosis, and automatic recovery into Antigravity AgentKit 2.0 with code patterns you can paste into your own product today
✦You'll come away with a feedback loop for tracing every recovery event, so your weekly review becomes the moment you stop being woken up at 3 a.m.
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Layer 1: Health signals — what to measure, when to call it abnormal
Detection starts with picking the right signals. The four I currently track:
Tool-call success rate over the last 10 minutes — below 80% triggers warning, below 50% triggers critical
P95 response latency over the last 10 minutes — three times the baseline triggers warning
Empty-response rate over the last 100 calls — the LLM returning "" or null. Above 5% triggers warning
Task completion rate over the last hour — proportion of tasks the agent marked "done". Below 70% triggers critical
Numbers 3 and 4 are the ones traditional monitoring misses. Conventional server monitoring assumes "no errors = healthy", but agents fail silently all the time. Without watching task completion, you won't notice that "no errors" actually means "no answers".
Here's the AgentKit 2.0–compatible signal collector I run today. Paste-able if you have Redis available:
# requirements: antigravity-agentkit>=2.0.0, redis>=5.0.0# env vars: REDIS_URL (e.g. redis://localhost:6379/0)import timeimport osfrom typing import Literalimport redisHealthStatus = Literal["healthy", "warning", "critical"]class HealthSignals: """Aggregates health signals into Redis and exposes a consolidated state. Why Redis instead of an in-memory deque: - Multi-process agents share state across workers - The agent process can crash without losing the recent window A small in-memory implementation works fine for single-process setups. """ def __init__(self, agent_id: str, redis_url: str | None = None): self.agent_id = agent_id self.r = redis.from_url(redis_url or os.environ["REDIS_URL"]) self.window_seconds = 600 # rolling 10-minute window def record_tool_call(self, tool: str, success: bool, latency_ms: float) -> None: ts = time.time() key = f"agent:{self.agent_id}:tool_calls" member = f"{int(ts*1000)}:{success}:{latency_ms}:{tool}" self.r.zadd(key, {member: ts}) # auto-evict entries older than the window self.r.zremrangebyscore(key, 0, ts - self.window_seconds) def record_empty_response(self, was_empty: bool) -> None: key = f"agent:{self.agent_id}:empty_responses" self.r.lpush(key, "1" if was_empty else "0") self.r.ltrim(key, 0, 99) def status(self) -> HealthStatus: success_rate = self._success_rate() p95 = self._p95_latency() empty_rate = self._empty_rate() if success_rate < 0.5 or empty_rate > 0.20: return "critical" if success_rate < 0.8 or p95 > self._baseline_p95() * 3 or empty_rate > 0.05: return "warning" return "healthy" def _success_rate(self) -> float: # Implementation omitted for brevity. In production, track per-tool rates separately. ...
The expected behaviour: while the agent runs normally, status() returns "healthy". The moment an external API starts returning rate-limit errors, it shifts to "warning". With sustained tool failures, it becomes "critical".
The interesting design choice is when to read this. I read it at two points: at task start, and immediately before each tool call. If the agent enters a task in critical state, I refuse new write operations and switch to read-only mode until things settle.
Layer 2: Deterministic diagnosis — pattern-match before you reason
Once detection screams, you need to classify what's happening. Don't ask the LLM to do this. There are three reasons:
The LLM itself may be the unhealthy dependency
Inference cost balloons exactly when you most need to control cost
Non-deterministic diagnosis makes the system impossible to improve
I use a deterministic diagnoser instead — regex patterns plus external status-page lookups:
# Diagnosis layer: maps error patterns + external service state to a recovery strategyimport reimport httpxfrom dataclasses import dataclassfrom enum import Enumclass FailurePattern(Enum): RATE_LIMIT = "rate_limit" SERVICE_DOWN = "service_down" TIMEOUT = "timeout" EMPTY_LLM_RESPONSE = "empty_llm" JSON_PARSE_ERROR = "json_parse" UNKNOWN = "unknown"@dataclassclass Diagnosis: pattern: FailurePattern confidence: float # 0.0 to 1.0 recovery_strategy: str cooldown_seconds: intclass Diagnoser: RATE_LIMIT_PATTERNS = [ re.compile(r"rate.?limit", re.I), re.compile(r"too many requests", re.I), re.compile(r"429", re.I), ] TIMEOUT_PATTERNS = [ re.compile(r"timeout", re.I), re.compile(r"deadline exceeded", re.I), ] def diagnose(self, error_message: str, tool_name: str) -> Diagnosis: if any(p.search(error_message) for p in self.RATE_LIMIT_PATTERNS): return Diagnosis( pattern=FailurePattern.RATE_LIMIT, confidence=0.95, recovery_strategy="exponential_backoff_with_jitter", cooldown_seconds=60, ) if any(p.search(error_message) for p in self.TIMEOUT_PATTERNS): if self._service_is_degraded(tool_name): return Diagnosis( pattern=FailurePattern.SERVICE_DOWN, confidence=0.85, recovery_strategy="failover_to_secondary", cooldown_seconds=300, ) return Diagnosis( pattern=FailurePattern.TIMEOUT, confidence=0.70, recovery_strategy="retry_with_longer_timeout", cooldown_seconds=10, ) # No pattern matched — escalate return Diagnosis( pattern=FailurePattern.UNKNOWN, confidence=0.0, recovery_strategy="escalate_to_human", cooldown_seconds=0, ) def _service_is_degraded(self, tool_name: str) -> bool: """Check the upstream service's status page. Examples: status.stripe.com, status.openai.com, status.anthropic.com.""" try: r = httpx.get("https://status.stripe.com/api/v2/status.json", timeout=2.0) return r.json().get("status", {}).get("indicator") != "none" except Exception: return False # treat as healthy when status page itself is down
The output of diagnose() is a recovery_strategy string consumed by the next layer.
You might wonder: why regex and status-page checks instead of letting the LLM classify? Because diagnosis should complete in a few hundred milliseconds, for fractions of a cent, and produce the same answer every time. Those are exactly the properties LLMs are bad at when they're already unhealthy. I keep LLM-based reasoning for the final UNKNOWN escalation only.
Layer 3: Recovery strategies — graduated fallback
The recovery layer executes whichever strategy diagnosis chose. I currently support six:
exponential_backoff_with_jitter — for rate limits. 1s, 2s, 4s, 8s ... with up to 30% random jitter
failover_to_secondary — switch providers (Anthropic → Gemini, or vice versa)
retry_with_longer_timeout — for slow but not dead services. Doubles the timeout, retries up to 3x
circuit_break — stop calling for 15 minutes, then half-open a single probe
escalate_to_human — Slack me, agent stays silent
The non-negotiable is that every strategy must be idempotent and side-effect-safe. If recovery fires while the agent is creating a Stripe invoice, you must not double-charge the user. I require an idempotency_key on every write tool call.
Expected runtime behaviour: rate-limit error → wait 1s → 2s → 4s → 8s → 16s → give up. Service-down → instant failover to the secondary provider. Sustained failure across both → circuit breaker opens, half-opens 15 minutes later for a single probe.
Wiring it up: the AgentKit 2.0 middleware
Now we bind the three layers into a single middleware that sits in front of every tool call:
Replaying the original Stripe rate-limit incident with this middleware in place: the agent doesn't go silent. It enters exponential_backoff_with_jitter, eventually fails over to Gemini if Stripe stays unhealthy, and only pages me when the circuit breaker has been open long enough that human attention is warranted.
After three months in production: nighttime Slack pages dropped from 6.2/week to 0.4/week.
Tracing recoveries so the agent gets smarter over time
The system above survives outages — but it stops there unless you record what happened. I emit a structured trace event for every recovery, with:
timestamp
failed tool name
raw error message
diagnosis result (pattern + confidence)
recovery strategy executed
recovery duration
final outcome (success / escalation)
Reviewing this weekly is where the real improvement happens. Tools with a high UNKNOWN rate are candidates for new regex patterns. Strategies that consistently fail to recover within their cooldown window are candidates for tuning.
I pipe these traces into Looker Studio and watch failure rate, recovery success, and mean recovery time per tool. Operating an agent fleet starts to feel a lot like SRE error-budget thinking. For more on the SLO side, see Designing SRE-style SLOs and error budgets for Antigravity AI agents.
A real incident: how the design held up
Theory only goes so far. On 2026-03-18, OpenAI had a 47-minute partial outage in the us-east region. About 60% of requests timed out; the rest succeeded with elevated latency. This is the kind of failure self-healing was designed for — let me walk through what actually happened to my code-review agent.
At 14:23 UTC, the first timeouts started. The detection layer noticed the P95 latency tripling within two minutes and shifted the agent into warning. New tasks could still be accepted, but the middleware started caching health status more aggressively (5-second window).
At 14:25 UTC, the success rate dropped to 64%. State went critical. Write tools (the agent's PR-comment posting capability) were blocked. Read tools (file reading, diff inspection) continued to be served — the agent could still complete the analysis portion of its work, just not publish the result.
At 14:27 UTC, diagnosis kicked in. The first three timeouts in a row triggered a service_is_degraded check against the OpenAI status page, which had not yet been updated. The diagnoser fell back to the local TIMEOUT pattern with retry_with_longer_timeout. After two retries failed at 60s timeouts, the executor escalated to failover_to_secondary — which switched the agent to Anthropic Claude.
From 14:27 to 15:10 UTC, the agent ran entirely on the secondary provider. Quality was slightly different (Claude tends to be more verbose in code review), but no users were blocked. The Anthropic side had been kept warm by the 15-minute probe, so the failover added only 80ms of cold-start latency rather than the 4 seconds we saw in earlier incidents.
At 15:10 UTC, OpenAI started returning successful responses again. The circuit breaker for OpenAI was still half-open — it allowed one probe through, the probe succeeded, and the agent gradually shifted traffic back to the primary provider.
I noticed the incident the next morning when I reviewed the trace dashboard. I was not paged. No tickets had piled up. The total user-visible impact was a 2-minute window during the initial detection where four requests took 12+ seconds before the failover kicked in. Two users mentioned the slow response in passing — neither escalated.
This is the case I write incident reports about: not because anything went wrong, but because the system caught a 47-minute upstream outage and absorbed it almost invisibly. That moment is what justifies all the complexity above. Without it, those 47 minutes would have been a queue of failed PR reviews and a Sunday afternoon spent firefighting.
Hooking into existing observability stacks
Most teams already have Datadog, New Relic, Honeycomb, or Grafana. Don't replace them — extend them. The recovery traces are valuable precisely because they slot into what your team already watches.
The key spans I emit per recovery event:
agent.tool_call — top-level span for every tool invocation
agent.tool_call.error — child span when the call fails, with the raw error and tool metadata
agent.diagnosis — child span carrying the pattern, confidence, and chosen strategy
agent.recovery — child span for the actual recovery execution, with attempt count, total wait time, and final outcome
In OpenTelemetry-compatible format, this lets ops engineers filter incidents by recovery strategy, count escalate_to_human events per service, and correlate agent behaviour with upstream provider outages. I keep a Grafana dashboard with three panels: "incidents per hour by tool", "recovery strategies invoked (stacked)", and "MTTR distribution". When something looks wrong, it usually shows up here first.
Honeycomb in particular is a good fit because the high cardinality of tool_name and recovery_strategy plays to its strengths. Slicing by recovery_strategy:circuit_break for the last 24 hours is one click. Doing the same in a metric-only system requires pre-aggregation that you probably did not set up in advance.
A note on cost during recovery
Self-healing is not free. Every retry costs tokens, every failover routes to a different (potentially more expensive) provider, and every unnecessary circuit-break leaves user requests unanswered. The instinct of "just retry harder" is the failure mode I see most often when teams adopt this pattern.
I track three cost-side metrics alongside the reliability ones:
Cost per incident — total tokens (and dollars) spent during recovery for each failed task. If this spikes, your strategies are too aggressive
Wasted retries — calls that succeeded after retry but where the original error was a real bug, not a transient. These should approach zero. If they don't, you're masking real bugs
Failover cost differential — how much more expensive your secondary provider is. If you fail over often to a 2x-priced model, the math may argue for fixing your primary instead
A practical guardrail: set a per-task recovery budget. If recovery has already consumed, say, 4x the original task's expected cost, give up and escalate. Without this, a flapping dependency can run up a serious bill before anyone notices.
# Budgeted executor — caps how much cash we spend recovering from a single taskclass BudgetedExecutor(RecoveryExecutor): def __init__(self, primary_provider: str, max_recovery_cost_usd: float = 0.10): super().__init__(primary_provider) self.max_cost = max_recovery_cost_usd async def execute(self, strategy, operation, *args, **kwargs): spent = kwargs.pop("_recovery_spent_usd", 0.0) if spent >= self.max_cost: await self._notify_slack( f"Recovery budget exhausted: ${spent:.4f} / ${self.max_cost}", args, kwargs, ) raise HumanEscalationRequired("Recovery budget exhausted.") return await super().execute(strategy, operation, *args, **kwargs)
This kind of budget is the difference between "self-healing" and "self-bankrupting".
Testing self-healing logic without breaking production
The most awkward part of this work is testing it. You cannot easily reproduce a Stripe rate-limit on a sandbox key, and "wait for it to happen in production" is exactly the wrong feedback loop.
What I do instead is simulate failures at the middleware boundary. I have a FailureInjector that I enable only in staging — and a small set of integration tests that walk through every diagnosis path:
# Failure injection harness for testing recovery pathsimport pytestimport asyncioclass FailureInjector: # Wraps any tool to inject specific failure modes during testing def __init__(self, tool, failure_mode=None, fail_count=1): self.tool = tool self.failure_mode = failure_mode self.fail_count = fail_count self.calls = 0 async def invoke(self, *args, **kwargs): self.calls += 1 if self.failure_mode and self.calls <= self.fail_count: if self.failure_mode == "rate_limit": raise Exception("429: rate limit exceeded") if self.failure_mode == "timeout": raise asyncio.TimeoutError("deadline exceeded") if self.failure_mode == "empty": return "" return await self.tool.invoke(*args, **kwargs)@pytest.mark.asyncioasync def test_rate_limit_recovers_via_backoff(): tool = FailureInjector(stripe_list_invoices, failure_mode="rate_limit", fail_count=2) middleware = SelfHealingMiddleware(...) result = await middleware.with_recovery(tool.invoke, customer_id="cus_123") assert result is not None assert tool.calls == 3 # failed twice, succeeded on third attempt
What I gained from this discipline: confidence to ship recovery changes without staging an actual rate-limit incident. What I lost: a few hours building the harness. The trade was very much worth it.
A subtler benefit is that writing the failure-injection tests forces you to enumerate failure modes you had not thought about. The first time I wrote tests for empty-LLM responses, I realised my code did not even have a code path for that — it just propagated None and broke downstream. That bug never reached production because the test made it visible.
Choosing what to monitor first
If you are starting fresh, do not try to wire up all four health signals on day one. Start with the one that maps most cleanly to your agent's purpose:
For agents that produce content (reports, summaries, code), watch empty-response rate first. Quality regressions hide there
For agents that orchestrate workflows (multi-step tasks, long-running jobs), watch task completion rate first. Silent stalls hide there
For agents that integrate with external APIs (Stripe, Slack, GitHub), watch tool-call success rate first. Rate limits and outages hide there
For agents users interact with synchronously, watch P95 latency first. The user experience hides there
Pick one signal, instrument it cleanly, and run for two weeks before adding the next. Adding all four at once means you will tune none of them well — and tuning is where the value comes from. The threshold values I cited (80%, 50%, 5%, 70%) work for my workload, not yours. Watch the actual distribution before fixing thresholds.
The same principle applies to recovery strategies. Implement exponential_backoff_with_jitter first, since it covers more than half of all incidents. Add failover_to_secondary second once you have a working secondary integration. Defer circuit_break until you have actual traces showing repeated failure cascades; without that data, you will set the wrong cooldown windows and either trip too eagerly or never trip at all.
When NOT to add self-healing
I want to argue against self-healing for a moment, because the worst case is to layer all of this on an agent that does not need it.
You do not need this design when:
The agent runs on demand, with a human watching the result. If you are using Antigravity interactively for code edits, the human is the recovery loop
The agent's failure mode is a bad answer rather than no answer. Self-healing will not help quality issues — that is an evaluation problem
The agent's tasks are cheap to retry from scratch. If the worst case is "redo the entire task", retry-from-scratch is simpler than mid-task recovery
You have not run the agent in production long enough to know what actually breaks. The biggest mistake is building self-healing in advance against imagined failures. Build it after you have observed real ones for at least 30 days
Self-healing is a maintenance burden. The diagnosis layer needs updating when error formats drift. The recovery code needs care when adding new providers. If your agent does not need 24/7 reliability, simpler designs win.
The clearest signal that you do need it: someone is being paged outside business hours.
Three traps I walked into in production
A clean design diagram doesn't survive contact with production. Here are the three traps that bit me:
Trap 1: the health check made the agent slower
My first version queried Redis on every tool call. 3–8ms per call sounds fine until you realise complex agent loops make 30+ tool calls. Total overhead crossed 500ms in some sessions. The fix: cache the health status for 5 seconds. A 5-second delay in transitioning to critical is acceptable; a 500ms tax on every successful loop is not.
Trap 2: failover targets had cold-start latency
The first time we failed over from Anthropic to Gemini, the Gemini side hadn't been hit in hours. The first call took 4 extra seconds. Users complained that "the failover is slower than the failure". The fix: send a small warm-up probe to both providers every 15 minutes so neither cache goes cold.
Trap 3: my regexes broke when OpenAI changed error wording
OpenAI quietly switched from "Quota exceeded" to "Resource has been exhausted". My rate-limit regex stopped matching. Everything fell into UNKNOWN. The fix: measure the diagnoser itself. I now alert when the share of UNKNOWN diagnoses crosses 5% of recent failures — that catches drifted error-wording before users do.
How to know it's working
If you can't measure the change, this is just over-engineering. Four metrics actually move:
MTTR — time from first error to recovered response. Mine went from ~45 min (manual intervention) to ~1.2 min (automatic)
Auto-recovery rate — share of incidents that resolved without escalate_to_human. Target 90%, currently 93%
False-positive rate — circuit breakers tripped on healthy traffic. Target ≤1/week, currently 0–2
Pages outside business hours — the human cost. Mine: 6.2/week → 0.4/week
That last one is the one I actually care about. If your night-time Slack notifications aren't trending toward zero, the design isn't paying off yet. For the broader observability picture, see Designing trace and observability for AI agents in Antigravity.
One concrete next step
Don't try to ship all three layers at once. The single most useful first step is this:
Open the error logs of an agent you're already running. List the top five failure patterns over the last 30 days. Give each one a name.
That alone tells you where your agent actually breaks, which determines whether you start with detection or diagnosis. I started with diagnosis because three patterns covered 70% of my incidents — handling just those three cut my night-time pages in half before I'd built any of the rest.
Your agent will only start getting smarter once it stops dying silently. The three-layer design is the first loop that makes that possible.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.