ANTIGRAVITY LABJP
Articles/Agents & Manager
Agents & Manager/2026-04-28Advanced

Designing Production Incident Runbooks for Antigravity Agents: A Practical Framework from Detection to Recovery

A complete guide to designing incident runbooks for production Antigravity Agents — detection, triage, mitigation, and postmortem, with working code you can drop into your stack today.

antigravity346agents90incident-responsesre2runbookproduction63

Premium Article

import RelatedArticles from "@/components/RelatedArticles";

"Slack went off at 2am, and my Antigravity Agent had hammered the same tool 200 times in a row before getting stuck." That actually happened to me. Half-asleep, I opened Manager Surface and burned 30 minutes just figuring out which trace to look at first. With a runbook in place, I could have reached the root cause in 5 minutes.

Operating AI Agents in production produces a different shape of failure than typical web services. CPU is healthy, the API returns 200, and yet "the thinking is broken." This article hands you the complete Antigravity-specific runbook framework I've sharpened across multiple products, code and all.

The design scales from solo indie projects to dozens of agents in production. By the end, you'll have concrete steps to triage within 5 minutes of the first alert and minimize user impact.

Why Antigravity Agents Need a Dedicated Runbook

Traditional web service runbooks are built around external signals: traffic spikes, broken database connections. Antigravity Agent failures look different.

  • "Working but wrong" failures are common: tool calls succeed, APIs return 200, but the output fails the business requirement
  • Failures don't propagate cleanly: in multi-agent setups, a Worker Agent can be broken while the Manager Agent reports "no issues"
  • Cost itself is an incident signal: by the time token consumption spikes, hundreds of dollars in damage may already be done
  • Reproducibility is low: identical inputs often refuse to reproduce the bug on a second run

In other words, runbooks built around HTTP status and CPU don't catch what matters. You need a flow centered on Antigravity-specific identifiers like runId, traceId, and agentSpanId.

When I started, I tried reusing generic SRE runbook templates and ended up burning brain cycles every incident on "which logs do I look at?" — fatal at 2am. Switching to an Agent-specific framework brought my median triage time from 18 minutes down to 4.

The Four-Tier Runbook Model — Lightweight Enough for Indie Devs

What I landed on is a four-tier structure. Heavy processes don't survive, so I optimized for what a solo developer can sustain.

  • L0: Detection — the layer that first notices something is wrong. Centralizes alert definitions and triggers
  • L1: Triage — within 5 minutes, decide "user-facing yes/no" and "self-healing yes/no"
  • L2: Mitigation — immediate actions to stop user impact: kill switch, fallback, traffic shedding
  • L3: Recovery & Postmortem — root-cause fix, runbook update, prevention work

Each tier has its own checklist and code snippets. Nobody thinks clearly at 2am, so the runbook needs to do the thinking for you.

L0: Detection — Four Distinct Alert Categories

Start by separating Antigravity Agent metrics into four buckets. Mixing them on one dashboard guarantees you'll miss things.

// monitoring/agent-alerts.ts
// Antigravity Agent alert definitions (OpenTelemetry + PromQL)
import { Counter, Gauge, Histogram } from "@opentelemetry/api";
 
// 1. Liveness: is the Agent running at all?
export const agentRunCount = new Counter({
  name: "antigravity_agent_run_total",
  help: "Total number of Agent runs by status",
  labelNames: ["agent_id", "status"], // status: success | failure | timeout
});
 
// 2. Quality: is the output meeting business requirements?
export const agentEvalScore = new Histogram({
  name: "antigravity_agent_eval_score",
  help: "Eval score (0-1) for Agent outputs",
  labelNames: ["agent_id", "eval_type"],
  buckets: [0.5, 0.7, 0.8, 0.9, 0.95],
});
 
// 3. Cost: are tokens within budget?
export const agentTokenSpend = new Counter({
  name: "antigravity_agent_token_spend_usd",
  help: "Cumulative USD spend per Agent",
  labelNames: ["agent_id", "model"],
});
 
// 4. Loop: is the Agent hammering the same tool?
export const agentToolCallStreak = new Gauge({
  name: "antigravity_agent_tool_call_streak",
  help: "Consecutive identical tool calls (potential loop)",
  labelNames: ["agent_id", "tool_name"],
});

The hard-won lesson from my own work: if you don't design cost alerts first, you can't undo the damage. I have personally torched $300 in a single night. Build a cumulative agentTokenSpend counter and a "page if more than $X per hour" rule into day one of your deployment.

Here's a corresponding PromQL alert ruleset, written for Cloudflare or Grafana Cloud.

# monitoring/alerts.yml
# Production alert rules for Antigravity Agents
groups:
  - name: antigravity_agent_alerts
    interval: 30s
    rules:
      # 1. Liveness: failure rate > 20% over 5 minutes
      - alert: AgentFailureRateHigh
        expr: |
          (
            sum(rate(antigravity_agent_run_total{status="failure"}[5m])) by (agent_id)
            / sum(rate(antigravity_agent_run_total[5m])) by (agent_id)
          ) > 0.2
        for: 5m
        labels:
          severity: page
          runbook: agent-failure-rate
        annotations:
          summary: "Agent {{ $labels.agent_id }} failure rate > 20%"
 
      # 2. Cost: more than $20 in one hour (indie scale)
      - alert: AgentTokenSpendBurst
        expr: |
          increase(antigravity_agent_token_spend_usd[1h]) > 20
        for: 5m
        labels:
          severity: page
          runbook: agent-cost-burst
        annotations:
          summary: "Agent {{ $labels.agent_id }} burned ${{ $value }} in 1h"
 
      # 3. Loop: same tool called 30 times in a row
      - alert: AgentToolLoopDetected
        expr: antigravity_agent_tool_call_streak > 30
        for: 1m
        labels:
          severity: page
          runbook: agent-tool-loop
        annotations:
          summary: "Agent {{ $labels.agent_id }} stuck on {{ $labels.tool_name }}"
 
      # 4. Quality: median eval score below 0.7 over 1 hour
      - alert: AgentQualityDegraded
        expr: |
          histogram_quantile(0.5, sum(rate(antigravity_agent_eval_score_bucket[1h])) by (le, agent_id)) < 0.7
        for: 15m
        labels:
          severity: ticket
          runbook: agent-quality-drop
        annotations:
          summary: "Agent {{ $labels.agent_id }} median eval score < 0.7"

severity: page interrupts the on-call immediately; severity: ticket is fine for next business day. Marking everything as page is how you guarantee you'll miss the one that matters. I learned the hard way to draw that line carefully.

L1: The 5-Minute Triage Protocol

When the alert fires, you want to answer three questions within 5 minutes:

  1. Is there user impact? (If yes, jump to L2 immediately.)
  2. Is automatic recovery possible? (If yes, retry first.)
  3. Is there blast-radius risk? (Other agents, downstream services.)

The Markdown template below distills this flow. Drop it into Slack Workflow Builder or Notion.

## Triage Checklist (complete within 5 minutes)
 
- [ ] Capture the **alert ID** (e.g., `AgentFailureRateHigh-2026-04-28-02-15`)
- [ ] In Manager Surface, open the top 3 `runId`s for the affected `agent_id`
- [ ] Identify the error pattern:
  - [ ] `ToolTimeout` -> L2-A: timeout-relaxation flow
  - [ ] `RateLimitExceeded` -> L2-B: backoff-extension flow
  - [ ] `MaxIterationsReached` -> L2-C: loop-detection flow
  - [ ] `EvalScoreDrop` -> L2-D: model-rollback flow
  - [ ] Other -> L2-E: consider firing the kill switch
- [ ] Search the `is_user_facing` field with `cmd+K`:
  - "yes": fire L2 within 60 seconds
  - "no": move to L3 within 15 minutes
- [ ] Check blast radius:
  - [ ] Failure rate of other agents using the same tool (PromQL: `rate(antigravity_agent_run_total{status="failure", tool_name="<name>"}[5m])`)
  - [ ] Response time of downstream services (DB, third-party APIs)
- [ ] Post `[L1 done] agent_id=xxx pattern=ToolTimeout user_impact=yes` to **#incidents-agent**

I refined this checklist by literally running 5-minute drills against past incidents. The trick is to write specific GUI actions like "open the top 3 runIds in Manager Surface." Vague advice like "check the logs" is useless when you're half-asleep.

L2: A Mitigation Library — Kill Switch in One Line

If user impact is happening, stop the bleeding before investigating. Have these five mitigation patterns ready for Antigravity Agents.

// runbook/mitigations.ts
// Production-incident mitigation library
import { ConfigStore } from "./config-store";
 
interface MitigationContext {
  agentId: string;
  reason: string;
  operator: string; // who is executing this
  durationMinutes?: number;
}
 
export class AgentMitigator {
  constructor(private config: ConfigStore) {}
 
  // Pattern A: Kill switch — fully stop the Agent
  async killSwitch(ctx: MitigationContext): Promise<void> {
    await this.config.set(`agent:${ctx.agentId}:enabled`, false, {
      ttlMinutes: ctx.durationMinutes ?? 60,
      audit: { reason: ctx.reason, operator: ctx.operator },
    });
    console.log(`[KILL] ${ctx.agentId} stopped for ${ctx.durationMinutes ?? 60}min`);
    await this.notifySlack(`🚨 ${ctx.agentId} stopped for ${ctx.durationMinutes ?? 60}min (${ctx.operator}: ${ctx.reason})`);
  }
 
  // Pattern B: Fallback — switch to a deterministic path
  async fallbackToDeterministic(ctx: MitigationContext): Promise<void> {
    await this.config.set(`agent:${ctx.agentId}:mode`, "fallback", {
      ttlMinutes: ctx.durationMinutes ?? 30,
      audit: { reason: ctx.reason, operator: ctx.operator },
    });
    await this.notifySlack(`🔄 ${ctx.agentId} switched to fallback mode`);
  }
 
  // Pattern C: Model downgrade — roll back to a stable version
  async downgradeModel(ctx: MitigationContext, fallbackModel: string): Promise<void> {
    await this.config.set(`agent:${ctx.agentId}:model`, fallbackModel, {
      ttlMinutes: ctx.durationMinutes ?? 1440,
      audit: { reason: ctx.reason, operator: ctx.operator },
    });
    await this.notifySlack(`⬇️ ${ctx.agentId} downgraded to ${fallbackModel}`);
  }
 
  // Pattern D: Throttle — cap concurrent runs
  async throttle(ctx: MitigationContext, maxConcurrent: number): Promise<void> {
    await this.config.set(`agent:${ctx.agentId}:max_concurrent`, maxConcurrent, {
      ttlMinutes: ctx.durationMinutes ?? 60,
      audit: { reason: ctx.reason, operator: ctx.operator },
    });
    await this.notifySlack(`🐌 ${ctx.agentId} throttled to ${maxConcurrent} concurrent runs`);
  }
 
  // Pattern E: Force-break a loop — cancel an in-progress run
  async breakLoop(ctx: MitigationContext, runId: string): Promise<void> {
    await this.config.set(`run:${runId}:cancel`, true, {
      ttlMinutes: 5,
      audit: { reason: ctx.reason, operator: ctx.operator },
    });
    await this.notifySlack(`✂️ Force-cancelled runId=${runId}`);
  }
 
  private async notifySlack(message: string): Promise<void> {
    const webhookUrl = process.env.SLACK_INCIDENT_WEBHOOK_URL;
    if (!webhookUrl) {
      console.warn("SLACK_INCIDENT_WEBHOOK_URL not set, skipping notification");
      return;
    }
    try {
      await fetch(webhookUrl, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text: message, channel: "#incidents-agent" }),
      });
    } catch (e) {
      console.error("Slack notify failed:", e);
      // Crucial: never let notification failure abort the mitigation
    }
  }
}
 
// Usage: kill switch in one line
// const mitigator = new AgentMitigator(configStore);
// await mitigator.killSwitch({ agentId: "code-reviewer", reason: "loop detected", operator: "masaki" });

The deliberate design choice here: don't let Slack failure abort the mitigation. Slack itself going down during an incident is more common than you'd think. Wrap the notification in try/catch and let the actual mitigation always complete.

The mandatory TTL is equally important. It prevents the classic mistake of manually setting enabled = false and forgetting about it until next week. Force automatic recovery within 24 hours; if you need longer, you have to make a deliberate choice to extend.

L3: Recovery — Tracing Root Cause from the Trace ID

Once the bleeding stops, you investigate root cause and ship a permanent fix. With Antigravity Agents, the traceId is your strongest weapon: you can replay the "trail of thought."

// runbook/postmortem-data.ts
// Postmortem data collector
// Usage: node postmortem-data.ts <traceId>
 
import { AntigravityClient } from "@antigravity/sdk";
import { writeFileSync } from "node:fs";
 
interface PostmortemBundle {
  traceId: string;
  agentId: string;
  startedAt: string;
  endedAt: string;
  totalTokens: number;
  totalUsd: number;
  toolCalls: Array<{
    spanId: string;
    toolName: string;
    durationMs: number;
    success: boolean;
    inputHash: string;
    outputHash: string;
  }>;
  modelMessages: Array<{
    role: string;
    contentSummary: string; // first 200 chars
    tokens: number;
  }>;
  evalScores: Array<{ evalType: string; score: number }>;
}
 
async function collectBundle(traceId: string): Promise<PostmortemBundle> {
  const client = new AntigravityClient({
    apiKey: process.env.ANTIGRAVITY_API_KEY!,
  });
 
  try {
    const trace = await client.traces.get(traceId);
    const spans = await client.traces.spans(traceId);
    const evals = await client.traces.evals(traceId);
 
    return {
      traceId,
      agentId: trace.agentId,
      startedAt: trace.startedAt,
      endedAt: trace.endedAt ?? new Date().toISOString(),
      totalTokens: trace.totalTokens,
      totalUsd: trace.totalUsd,
      toolCalls: spans
        .filter((s) => s.type === "tool_call")
        .map((s) => ({
          spanId: s.id,
          toolName: s.attributes.tool_name,
          durationMs: s.durationMs,
          success: s.status === "ok",
          inputHash: s.attributes.input_hash,
          outputHash: s.attributes.output_hash,
        })),
      modelMessages: spans
        .filter((s) => s.type === "model_message")
        .map((s) => ({
          role: s.attributes.role,
          contentSummary: (s.attributes.content ?? "").slice(0, 200),
          tokens: s.attributes.tokens ?? 0,
        })),
      evalScores: evals.map((e) => ({ evalType: e.type, score: e.score })),
    };
  } catch (e) {
    console.error(`Failed to collect bundle for ${traceId}:`, e);
    throw new Error(`Trace ${traceId} not found or API error`);
  }
}
 
// CLI entry point
const traceId = process.argv[2];
if (!traceId) {
  console.error("Usage: node postmortem-data.ts <traceId>");
  process.exit(1);
}
 
collectBundle(traceId)
  .then((bundle) => {
    const filename = `postmortem-${traceId.slice(0, 8)}-${Date.now()}.json`;
    writeFileSync(filename, JSON.stringify(bundle, null, 2));
    console.log(`✅ Saved bundle: ${filename}`);
    console.log(`   Tool calls: ${bundle.toolCalls.length}`);
    console.log(`   Token cost: $${bundle.totalUsd.toFixed(2)}`);
    console.log(`   Failures: ${bundle.toolCalls.filter((c) => !c.success).length}`);
  })
  .catch((e) => {
    console.error("❌ Bundle collection failed:", e.message);
    process.exit(1);
  });

Run it with npx tsx postmortem-data.ts <traceId> and you have a single JSON file you can attach to your postmortem in Notion.

A typical run looks like this:

✅ Saved bundle: postmortem-abc12345-1714356123456.json
   Tool calls: 247
   Token cost: $18.42
   Failures: 12

Seeing Tool calls: 247 immediately flags an anomaly. Healthy runs usually sit between 5 and 30 calls.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
You can keep a battle-tested runbook template that lets you triage a runaway Agent at 2am within 5 minutes
You'll learn concrete patterns for tying detection, mitigation, and recovery to Antigravity's traceId and Manager Surface so the runbook actually fits this platform
You'll have a 30-minute lightweight postmortem format that an indie developer can sustain — and a system that prevents the same incident from happening twice
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Agents & Manager2026-04-24
SRE for Antigravity Agents — Taming Probabilistic Systems with SLOs and Error Budgets
AI agents are probabilistic by nature, so running them in production without SRE thinking is risky. This guide shows how to apply SLIs, SLOs, and error budgets to Antigravity agents with working code and concrete operational decisions.
Agents & Manager2026-05-29
Supervising Long-Running Antigravity Agents — Watchdog and Tiered Recovery
Eight weeks of running AdMob revenue optimization on Antigravity background agents revealed three quiet failure modes. Here is the watchdog plus tiered recovery design I landed on.
Agents & Manager2026-05-27
Record & Replay for Antigravity Agents — A Production Pattern to Reproduce Failures in 3 Minutes
How to deterministically replay a failed Antigravity Agent run offline, drawn from a month of running it across four production sites. Covers boundary recording, R2 + KV storage costs, PII masking, and a working TypeScript harness.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →