Articles/Agents & Manager

◈ Agents & Manager/2026-04-28Advanced

Designing Production Incident Runbooks for Antigravity Agents: A Practical Framework from Detection to Recovery

A complete guide to designing incident runbooks for production Antigravity Agents — detection, triage, mitigation, and postmortem, with working code you can drop into your stack today.

antigravity⁴³⁶ agents¹²⁹ incident-response sre² runbook production⁷¹

✦ Premium Article

import RelatedArticles from "@/components/RelatedArticles";

"Slack went off at 2am, and my Antigravity Agent had hammered the same tool 200 times in a row before getting stuck." That actually happened to me. Half-asleep, I opened Manager Surface and burned 30 minutes just figuring out which trace to look at first. With a runbook in place, I could have reached the root cause in 5 minutes.

Operating AI Agents in production produces a different shape of failure than typical web services. CPU is healthy, the API returns 200, and yet "the thinking is broken." This article hands you the complete Antigravity-specific runbook framework I've sharpened across multiple products, code and all.

The design scales from solo indie projects to dozens of agents in production. By the end, you'll have concrete steps to triage within 5 minutes of the first alert and minimize user impact.

Why Antigravity Agents Need a Dedicated Runbook

Traditional web service runbooks are built around external signals: traffic spikes, broken database connections. Antigravity Agent failures look different.

"Working but wrong" failures are common: tool calls succeed, APIs return 200, but the output fails the business requirement
Failures don't propagate cleanly: in multi-agent setups, a Worker Agent can be broken while the Manager Agent reports "no issues"
Cost itself is an incident signal: by the time token consumption spikes, hundreds of dollars in damage may already be done
Reproducibility is low: identical inputs often refuse to reproduce the bug on a second run

In other words, runbooks built around HTTP status and CPU don't catch what matters. You need a flow centered on Antigravity-specific identifiers like runId, traceId, and agentSpanId.

When I started, I tried reusing generic SRE runbook templates and ended up burning brain cycles every incident on "which logs do I look at?" — fatal at 2am. Switching to an Agent-specific framework brought my median triage time from 18 minutes down to 4.

The Four-Tier Runbook Model — Lightweight Enough for Indie Devs

What I landed on is a four-tier structure. Heavy processes don't survive, so I optimized for what a solo developer can sustain.

L0: Detection — the layer that first notices something is wrong. Centralizes alert definitions and triggers
L1: Triage — within 5 minutes, decide "user-facing yes/no" and "self-healing yes/no"
L2: Mitigation — immediate actions to stop user impact: kill switch, fallback, traffic shedding
L3: Recovery & Postmortem — root-cause fix, runbook update, prevention work

Each tier has its own checklist and code snippets. Nobody thinks clearly at 2am, so the runbook needs to do the thinking for you.

L0: Detection — Four Distinct Alert Categories

Start by separating Antigravity Agent metrics into four buckets. Mixing them on one dashboard guarantees you'll miss things.

// monitoring/agent-alerts.ts
// Antigravity Agent alert definitions (OpenTelemetry + PromQL)
import { Counter, Gauge, Histogram } from "@opentelemetry/api";
 
// 1. Liveness: is the Agent running at all?
export const agentRunCount = new Counter({
  name: "antigravity_agent_run_total",
  help: "Total number of Agent runs by status",
  labelNames: ["agent_id", "status"], // status: success | failure | timeout
});
 
// 2. Quality: is the output meeting business requirements?
export const agentEvalScore = new Histogram({
  name: "antigravity_agent_eval_score",
  help: "Eval score (0-1) for Agent outputs",
  labelNames: ["agent_id", "eval_type"],
  buckets: [0.5, 0.7, 0.8, 0.9, 0.95],
});
 
// 3. Cost: are tokens within budget?
export const agentTokenSpend = new Counter({
  name: "antigravity_agent_token_spend_usd",
  help: "Cumulative USD spend per Agent",
  labelNames: ["agent_id", "model"],
});
 
// 4. Loop: is the Agent hammering the same tool?
export const agentToolCallStreak = new Gauge({
  name: "antigravity_agent_tool_call_streak",
  help: "Consecutive identical tool calls (potential loop)",
  labelNames: ["agent_id", "tool_name"],
});

The hard-won lesson from my own work: if you don't design cost alerts first, you can't undo the damage. I have personally torched $300 in a single night. Build a cumulative agentTokenSpend counter and a "page if more than $X per hour" rule into day one of your deployment.

Here's a corresponding PromQL alert ruleset, written for Cloudflare or Grafana Cloud.

# monitoring/alerts.yml
# Production alert rules for Antigravity Agents
groups:
  - name: antigravity_agent_alerts
    interval: 30s
    rules:
      # 1. Liveness: failure rate > 20% over 5 minutes
      - alert: AgentFailureRateHigh
        expr: |
          (
            sum(rate(antigravity_agent_run_total{status="failure"}[5m])) by (agent_id)
            / sum(rate(antigravity_agent_run_total[5m])) by (agent_id)
          ) > 0.2
        for: 5m
        labels:
          severity: page
          runbook: agent-failure-rate
        annotations:
          summary: "Agent {{ $labels.agent_id }} failure rate > 20%"
 
      # 2. Cost: more than $20 in one hour (indie scale)
      - alert: AgentTokenSpendBurst
        expr: |
          increase(antigravity_agent_token_spend_usd[1h]) > 20
        for: 5m
        labels:
          severity: page
          runbook: agent-cost-burst
        annotations:
          summary: "Agent {{ $labels.agent_id }} burned ${{ $value }} in 1h"
 
      # 3. Loop: same tool called 30 times in a row
      - alert: AgentToolLoopDetected
        expr: antigravity_agent_tool_call_streak > 30
        for: 1m
        labels:
          severity: page
          runbook: agent-tool-loop
        annotations:
          summary: "Agent {{ $labels.agent_id }} stuck on {{ $labels.tool_name }}"
 
      # 4. Quality: median eval score below 0.7 over 1 hour
      - alert: AgentQualityDegraded
        expr: |
          histogram_quantile(0.5, sum(rate(antigravity_agent_eval_score_bucket[1h])) by (le, agent_id)) < 0.7
        for: 15m
        labels:
          severity: ticket
          runbook: agent-quality-drop
        annotations:
          summary: "Agent {{ $labels.agent_id }} median eval score < 0.7"

severity: page interrupts the on-call immediately; severity: ticket is fine for next business day. Marking everything as page is how you guarantee you'll miss the one that matters. I learned the hard way to draw that line carefully.

L1: The 5-Minute Triage Protocol

When the alert fires, you want to answer three questions within 5 minutes:

Is there user impact? (If yes, jump to L2 immediately.)
Is automatic recovery possible? (If yes, retry first.)
Is there blast-radius risk? (Other agents, downstream services.)

The Markdown template below distills this flow. Drop it into Slack Workflow Builder or Notion.

## Triage Checklist (complete within 5 minutes)
 
- [ ] Capture the **alert ID** (e.g., `AgentFailureRateHigh-2026-04-28-02-15`)
- [ ] In Manager Surface, open the top 3 `runId`s for the affected `agent_id`
- [ ] Identify the error pattern:
  - [ ] `ToolTimeout` -> L2-A: timeout-relaxation flow
  - [ ] `RateLimitExceeded` -> L2-B: backoff-extension flow
  - [ ] `MaxIterationsReached` -> L2-C: loop-detection flow
  - [ ] `EvalScoreDrop` -> L2-D: model-rollback flow
  - [ ] Other -> L2-E: consider firing the kill switch
- [ ] Search the `is_user_facing` field with `cmd+K`:
  - "yes": fire L2 within 60 seconds
  - "no": move to L3 within 15 minutes
- [ ] Check blast radius:
  - [ ] Failure rate of other agents using the same tool (PromQL: `rate(antigravity_agent_run_total{status="failure", tool_name="<name>"}[5m])`)
  - [ ] Response time of downstream services (DB, third-party APIs)
- [ ] Post `[L1 done] agent_id=xxx pattern=ToolTimeout user_impact=yes` to **#incidents-agent**

I refined this checklist by literally running 5-minute drills against past incidents. The trick is to write specific GUI actions like "open the top 3 runIds in Manager Surface." Vague advice like "check the logs" is useless when you're half-asleep.

L2: A Mitigation Library — Kill Switch in One Line

If user impact is happening, stop the bleeding before investigating. Have these five mitigation patterns ready for Antigravity Agents.

// runbook/mitigations.ts
// Production-incident mitigation library
import { ConfigStore } from "./config-store";
 
interface MitigationContext {
  agentId: string;
  reason: string;
  operator: string; // who is executing this
  durationMinutes?: number;
}
 
export class AgentMitigator {
  constructor(private config: ConfigStore) {}
 
  // Pattern A: Kill switch — fully stop the Agent
  async killSwitch(ctx: MitigationContext): Promise<void> {
    await this.config.set(`agent:${ctx.agentId}:enabled`, false, {
      ttlMinutes: ctx.durationMinutes ?? 60,
      audit: { reason: ctx.reason, operator: ctx.operator },
    });
    console.log(`[KILL] ${ctx.agentId} stopped for ${ctx.durationMinutes ?? 60}min`);
    await this.notifySlack(`🚨 ${ctx.agentId} stopped for ${ctx.durationMinutes ?? 60}min (${ctx.operator}: ${ctx.reason})`);
  }
 
  // Pattern B: Fallback — switch to a deterministic path
  async fallbackToDeterministic(ctx: MitigationContext): Promise<void> {
    await this.config.set(`agent:${ctx.agentId}:mode`, "fallback", {
      ttlMinutes: ctx.durationMinutes ?? 30,
      audit: { reason: ctx.reason, operator: ctx.operator },
    });
    await this.notifySlack(`🔄 ${ctx.agentId} switched to fallback mode`);
  }
 
  // Pattern C: Model downgrade — roll back to a stable version
  async downgradeModel(ctx: MitigationContext, fallbackModel: string): Promise<void> {
    await this.config.set(`agent:${ctx.agentId}:model`, fallbackModel, {
      ttlMinutes: ctx.durationMinutes ?? 1440,
      audit: { reason: ctx.reason, operator: ctx.operator },
    });
    await this.notifySlack(`⬇️ ${ctx.agentId} downgraded to ${fallbackModel}`);
  }
 
  // Pattern D: Throttle — cap concurrent runs
  async throttle(ctx: MitigationContext, maxConcurrent: number): Promise<void> {
    await this.config.set(`agent:${ctx.agentId}:max_concurrent`, maxConcurrent, {
      ttlMinutes: ctx.durationMinutes ?? 60,
      audit: { reason: ctx.reason, operator: ctx.operator },
    });
    await this.notifySlack(`🐌 ${ctx.agentId} throttled to ${maxConcurrent} concurrent runs`);
  }
 
  // Pattern E: Force-break a loop — cancel an in-progress run
  async breakLoop(ctx: MitigationContext, runId: string): Promise<void> {
    await this.config.set(`run:${runId}:cancel`, true, {
      ttlMinutes: 5,
      audit: { reason: ctx.reason, operator: ctx.operator },
    });
    await this.notifySlack(`✂️ Force-cancelled runId=${runId}`);
  }
 
  private async notifySlack(message: string): Promise<void> {
    const webhookUrl = process.env.SLACK_INCIDENT_WEBHOOK_URL;
    if (!webhookUrl) {
      console.warn("SLACK_INCIDENT_WEBHOOK_URL not set, skipping notification");
      return;
    }
    try {
      await fetch(webhookUrl, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text: message, channel: "#incidents-agent" }),
      });
    } catch (e) {
      console.error("Slack notify failed:", e);
      // Crucial: never let notification failure abort the mitigation
    }
  }
}
 
// Usage: kill switch in one line
// const mitigator = new AgentMitigator(configStore);
// await mitigator.killSwitch({ agentId: "code-reviewer", reason: "loop detected", operator: "masaki" });

The deliberate design choice here: don't let Slack failure abort the mitigation. Slack itself going down during an incident is more common than you'd think. Wrap the notification in try/catch and let the actual mitigation always complete.

The mandatory TTL is equally important. It prevents the classic mistake of manually setting enabled = false and forgetting about it until next week. Force automatic recovery within 24 hours; if you need longer, you have to make a deliberate choice to extend.

L3: Recovery — Tracing Root Cause from the Trace ID

Once the bleeding stops, you investigate root cause and ship a permanent fix. With Antigravity Agents, the traceId is your strongest weapon: you can replay the "trail of thought."

// runbook/postmortem-data.ts
// Postmortem data collector
// Usage: node postmortem-data.ts <traceId>
 
import { AntigravityClient } from "@antigravity/sdk";
import { writeFileSync } from "node:fs";
 
interface PostmortemBundle {
  traceId: string;
  agentId: string;
  startedAt: string;
  endedAt: string;
  totalTokens: number;
  totalUsd: number;
  toolCalls: Array<{
    spanId: string;
    toolName: string;
    durationMs: number;
    success: boolean;
    inputHash: string;
    outputHash: string;
  }>;
  modelMessages: Array<{
    role: string;
    contentSummary: string; // first 200 chars
    tokens: number;
  }>;
  evalScores: Array<{ evalType: string; score: number }>;
}
 
async function collectBundle(traceId: string): Promise<PostmortemBundle> {
  const client = new AntigravityClient({
    apiKey: process.env.ANTIGRAVITY_API_KEY!,
  });
 
  try {
    const trace = await client.traces.get(traceId);
    const spans = await client.traces.spans(traceId);
    const evals = await client.traces.evals(traceId);
 
    return {
      traceId,
      agentId: trace.agentId,
      startedAt: trace.startedAt,
      endedAt: trace.endedAt ?? new Date().toISOString(),
      totalTokens: trace.totalTokens,
      totalUsd: trace.totalUsd,
      toolCalls: spans
        .filter((s) => s.type === "tool_call")
        .map((s) => ({
          spanId: s.id,
          toolName: s.attributes.tool_name,
          durationMs: s.durationMs,
          success: s.status === "ok",
          inputHash: s.attributes.input_hash,
          outputHash: s.attributes.output_hash,
        })),
      modelMessages: spans
        .filter((s) => s.type === "model_message")
        .map((s) => ({
          role: s.attributes.role,
          contentSummary: (s.attributes.content ?? "").slice(0, 200),
          tokens: s.attributes.tokens ?? 0,
        })),
      evalScores: evals.map((e) => ({ evalType: e.type, score: e.score })),
    };
  } catch (e) {
    console.error(`Failed to collect bundle for ${traceId}:`, e);
    throw new Error(`Trace ${traceId} not found or API error`);
  }
}
 
// CLI entry point
const traceId = process.argv[2];
if (!traceId) {
  console.error("Usage: node postmortem-data.ts <traceId>");
  process.exit(1);
}
 
collectBundle(traceId)
  .then((bundle) => {
    const filename = `postmortem-${traceId.slice(0, 8)}-${Date.now()}.json`;
    writeFileSync(filename, JSON.stringify(bundle, null, 2));
    console.log(`✅ Saved bundle: ${filename}`);
    console.log(`   Tool calls: ${bundle.toolCalls.length}`);
    console.log(`   Token cost: $${bundle.totalUsd.toFixed(2)}`);
    console.log(`   Failures: ${bundle.toolCalls.filter((c) => !c.success).length}`);
  })
  .catch((e) => {
    console.error("❌ Bundle collection failed:", e.message);
    process.exit(1);
  });

Run it with npx tsx postmortem-data.ts <traceId> and you have a single JSON file you can attach to your postmortem in Notion.

A typical run looks like this:

✅ Saved bundle: postmortem-abc12345-1714356123456.json
   Tool calls: 247
   Token cost: $18.42
   Failures: 12

Seeing Tool calls: 247 immediately flags an anomaly. Healthy runs usually sit between 5 and 30 calls.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦You can keep a battle-tested runbook template that lets you triage a runaway Agent at 2am within 5 minutes

✦You'll learn concrete patterns for tying detection, mitigation, and recovery to Antigravity's traceId and Manager Surface so the runbook actually fits this platform

✦You'll have a 30-minute lightweight postmortem format that an indie developer can sustain — and a system that prevents the same incident from happening twice

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Postmortem — A 30-Minute Lightweight Format That Actually Survives

Most developers know postmortem culture matters — but writing the full "5 Whys" plus a detailed timeline takes 3 hours, so it doesn't survive the indie scale. Here's the stripped-down template I keep coming back to.

# Postmortem: <Incident Title> (YYYY-MM-DD)
 
## Impact
- Duration: HH:MM - HH:MM (N minutes total)
- Users affected: ~N people / N% of total
- Money lost: $N (token consumption from Agent runaway)
- Trust lost: subjective 1-5
 
## Timeline (concise)
- HH:MM detected (alert: AgentFailureRateHigh)
- HH:MM triage complete (pattern: ToolTimeout)
- HH:MM mitigation applied (kill switch)
- HH:MM recovery confirmed
 
## Root Cause
3-5 lines. Capture both the technical cause and the structural reason it was allowed to happen.
 
## Direct Fixes
- [ ] Fix PR: #1234
- [ ] Deployed
 
## Prevention (track with DA: Done / WIP / TBD)
- [DA] Tightened alert threshold from 20% to 10%
- [WIP] Reduce tool_call_streak limit from 30 to 15
- [TBD] Add periodic Eval for the same pattern
 
## Runbook Updates
- [ ] Added "ToolTimeout cascade detection" to L2 patterns
- [ ] Checklist now includes "blast-radius check on the same tool"
 
## Lessons (3 lines)
- What we got wrong
- What we'll change next
- What I want my future self / teammates to know

The heart of this format is the "Lessons (3 lines)" section. Trying to write everything makes you write nothing. Constrain yourself to three lines so that, six months later, you can still recall what you were thinking that night.

Common Mistakes — Landmines I've Stepped On

A round-up of mistakes I personally made, so you don't have to take the same calls at 2am.

Pitfall 1: Sending Alerts to a Personal DM

Sending notifications to a personal DM means you sleep through them. Sending them to a shared channel means nobody owns them. Force a rotation from day one. Even if you're a one-person team, use the free tiers of PagerDuty or Opsgenie and enable phone + push notifications. Slack-only alerts will lose you the most painful incidents.

Pitfall 2: "Just Disable It" as a Kill Switch

Setting enabled = false and walking away tends to leave it that way until next week. Make TTL mandatory, with auto-recovery at 24 hours. If user impact reappears, that's a signal "the fix isn't ready" — and forcing yourself to extend the kill switch creates the right pressure to actually finish the fix.

Pitfall 3: Postponing Cost Alerts

"Let's get it running first and add monitoring later" is how you torch hundreds of dollars in one night. I've personally lost $312 to a runaway loop. The cumulative agentTokenSpend counter and the "page if more than $X in 1h" rule are non-negotiable on day one — even in the most minimal setup.

Pitfall 4: "Let Me Just Read All the Logs" During Triage

This was my biggest reflection point. Scrolling through 500 spans tells you nothing. Force yourself to look at the first 3, the last 3, and only the failed spans. Trying to be exhaustive freezes your thinking.

Pitfall 5: Trying to Write the "Perfect" Postmortem

The moment you decide "I'll write it properly," you stop writing it. Use the lightweight template above with a 30-minute timer; "imperfect but saved" is the only goal that matters. Edit it later if you want. Unwritten postmortems are equivalent to no postmortems, and the cost of repeating the same incident in 6 months is far higher.

Pitfall 6: Calling "Prevention" Done When the PR Merges

Merging a fix doesn't update the runbook. Always pair postmortem work with editing the relevant runbook section. I keep my runbook in Notion and the postmortem template has a mandatory "Runbook Updates" field exactly for this.

Wiring the Runbook into Manager Surface

A runbook lives or dies by how easily you can reach it during the actual incident. Antigravity's Manager Surface gives you a few hooks worth setting up before the first 2am call.

The two most useful hooks I've added are per-agent runbook links and trace-aware action buttons. The first puts a clickable runbook URL on every Agent's detail page. The second exposes mitigation actions (kill switch, throttle, model downgrade) as buttons that already know the agentId and current runId you're looking at.

// manager-surface/runbook-links.ts
// Register per-agent runbook URLs surfaced in Manager Surface.
import { ManagerSurface } from "@antigravity/manager-surface";
 
const surface = new ManagerSurface({ apiKey: process.env.ANTIGRAVITY_API_KEY! });
 
await surface.registerAgentMetadata({
  agentId: "code-reviewer",
  runbookUrl: "https://notion.so/team/runbook-code-reviewer",
  primaryOnCall: "@masaki",
  costBudgetUsdPerHour: 5,
  killSwitchEnabledBy: ["@masaki", "@oncall-rotation"],
});
 
await surface.registerAgentMetadata({
  agentId: "data-extractor",
  runbookUrl: "https://notion.so/team/runbook-data-extractor",
  primaryOnCall: "@masaki",
  costBudgetUsdPerHour: 2,
  killSwitchEnabledBy: ["@masaki"],
});
 
console.log("✅ Runbook links registered for all agents");

Once registered, your future-self at 2am sees a "Open Runbook" button right next to the trace. That removes one more piece of cognitive load when you're least equipped to handle it.

Trace-aware action buttons matter even more. The pattern that has worked for me is to expose mitigation actions as Manager Surface custom commands, each one already wired to call the AgentMitigator library shown earlier. The result: kill switch becomes a single click on the trace you're already looking at, instead of a context switch into a terminal where you have to remember the right command flags.

Testing the Runbook with Game Days

Even a beautifully written runbook decays. APIs change, alerts get tweaked, the on-call rotation shifts. The cheapest insurance is a quarterly game day — a deliberate exercise where you simulate an incident and run through the full L0 -> L1 -> L2 -> L3 cycle.

I run game days as a 90-minute solo exercise. The format is simple:

Pick a past incident or invent a plausible scenario
Set a kitchen timer for 5 minutes and start at L1 with only the runbook
Note where the runbook fails to give you the next step within 30 seconds
After completing the simulation, edit the runbook to fix every gap you found

The result is that the runbook gradually becomes a near-perfect script. The first time I did this, my runbook had 11 gaps. By the third game day, it had 1.

If you have a teammate, pair game days are even better: one person plays "the system" by feeding fake alerts and trace data, the other plays "on-call." Switching roles keeps both people sharp on the operational pieces.

A surprising side effect: game days also surface gaps in your observability. If the runbook says "check the failure rate by tool" and the dashboard makes that hard to do, you've found a dashboard improvement, not just a runbook one.

Sizing Your Runbook to Your Stage

A common mistake is copying a Big Tech runbook template into a 1-person side project. The result is a 50-page document nobody reads. Right-size the runbook to where you are.

For solo indie projects with one or two production Agents, a single Notion page with the L0-L3 sections collapsed is enough. Keep it under 800 words total. The four alerts I described above plus the five mitigation patterns plus the lightweight postmortem template fit comfortably in that budget.

For a small team with five to ten Agents, split the runbook into a "global runbook" (incident process, severity definitions, postmortem template) and per-agent runbooks (the specific failure modes, mitigations, and on-call rotations for each Agent). The per-agent pages should still be short — under 500 words — and the global runbook should be the single source of truth for process.

For larger setups (dozens of Agents, multiple teams), invest in proper structured incident management: integrating PagerDuty or Opsgenie with Slack, using a tool like FireHydrant or incident.io to drive the process, and keeping the runbook generators in code rather than free-form documents. At this scale, a static runbook becomes a liability — you want incident automation that already knows which Agent failed and pre-fills the relevant context.

Severity Definitions That Don't Lie

Most teams adopt P1/P2/P3 severity tiers but never write down what makes an incident a P1. The result is that everything becomes a P2 because nobody wants to be the person to call something P1, and then the real P1 gets lost in the noise.

Be precise. Below is the severity table I currently use for Antigravity Agents — adapt the numbers, but keep the structure.

## Severity Definitions
 
### P1 (page on-call immediately, 24/7)
- User-facing failure rate above 20% for any Agent for more than 5 minutes
- Cost burst above $50/hour for any single Agent
- Tool loop pegged at 100+ identical calls
- Data corruption suspected (Agent wrote bad data to a system of record)
 
### P2 (page during business hours, ticket overnight)
- User-facing failure rate 5-20% sustained for more than 30 minutes
- Cost burst $20-50/hour
- Eval median below 0.7 sustained for more than 2 hours
- Latency P95 above 30s sustained for more than 30 minutes
 
### P3 (ticket, fix in current sprint)
- Eval median 0.7-0.85 sustained for more than 24 hours
- Single failed Agent run with no propagation
- Cost slowly trending upward without spike
 
### P4 (backlog, fix when convenient)
- Cosmetic issues, log spam, deprecation warnings
- Performance degradation under 10% with no user impact

Notice that every threshold has both a magnitude and a duration. "Failure rate above 20%" alone would page you for every momentary blip. "Above 20% for more than 5 minutes" filters out noise while catching real fires. This pattern matters for every alert: if you don't have a duration, you don't have an alert, you have a paging machine.

Also notice that the P1 list explicitly includes cost and data corruption, not just availability. AI Agents fail in ways traditional services don't, and your severity tiers should reflect that.

Communication During the Incident

If you have any users at all, communication is half of incident response. The runbook should make it impossible to forget. I keep three communication checklists, one per phase.

When You First Confirm User Impact

Within 5 minutes of confirming user impact, post a status update somewhere users can see it: a status page, a pinned tweet, a banner inside the app. Don't wait until you understand the cause. The message can be as simple as:

We're investigating an issue affecting <feature>. We'll post the next update by HH:MM.

The "next update by HH:MM" is the most important part. It commits you to a follow-up and stops the "is anyone there?" anxiety from users.

During Mitigation

If mitigation takes more than 15 minutes, post an interim update:

We've identified the issue and are applying a fix. <feature> is currently <degraded/unavailable>.
We'll post the next update by HH:MM.

Resist the temptation to share root-cause guesses publicly until you're confident. "We think it might be the database" turns into "they said it was the database" which turns into a Twitter thread you can't take back.

After Recovery

Once recovery is confirmed, post a final status update:

The issue affecting <feature> has been resolved as of HH:MM. We'll publish a postmortem within <timeframe>.

Then actually write the postmortem within the promised timeframe. Public postmortems build trust precisely because most companies don't follow through on the promise. For an indie product, even a 200-word "what happened, what we changed" note builds disproportionate credibility.

Building a "Day 0 Incident Kit"

If you're starting from zero, the temptation is to ignore incident response entirely until you have your first incident. Don't. Here's the smallest possible kit that gives you a fighting chance, sized for a solo indie launch.

The four hours to spend before your first production deployment are: one hour wiring up the four alert categories from earlier (with the cost alert being the most important), one hour writing the L1 triage checklist tailored to your specific Agent, one hour creating the lightweight postmortem template in Notion or wherever you write, and one hour setting up phone-capable on-call notifications (PagerDuty's free tier or Opsgenie's free tier both work).

This single afternoon will save you from the most common solo-developer disaster: a runaway Agent that costs you a thousand dollars overnight while you sleep. I've watched several friends learn this lesson the expensive way. The Day 0 kit is the cheapest insurance policy in indie AI Agent development.

Connecting the Runbook to Your Eval Harness

A subtle but high-leverage move: hook your runbook into your eval harness so that fixes prove themselves before they ship. Every postmortem should include a "regression eval" — a small set of inputs that would have triggered the original failure — added to the harness as part of the fix.

The mechanics are simple. After a postmortem identifies the failure mode, you write 5-20 representative inputs that would have caused the original incident. Those inputs go into a regression-evals/ folder, named after the postmortem (e.g., regression-evals/2026-04-28-tool-timeout-loop.json). Your CI runs the regression evals on every PR; a regression in pass rate blocks merges.

The result is that your runbook indirectly grows your test suite. Each incident strengthens the safety net that catches the next one. Over a year, this produces a defense-in-depth that no upfront test plan could match. I find this is one of the most underrated benefits of taking incident response seriously: it's not just about the current fire, it's about systematically reducing the cone of possible future fires.

A specific anti-pattern to avoid: don't add the failing input as a single test case and call it done. Real failure modes are clusters, not points. Spend 15 minutes generating 10-20 nearby inputs (different prompts, edge cases, length variations) and add the whole cluster. Otherwise you've trained your Agent to pass exactly one test, while the underlying weakness remains.

Worked Example: "Slack Goes Off at 2am" Walkthrough

Let me run through everything above with a real scenario from my own logs. Use it as a stress test for your own runbook.

HH:MM=02:14 — PagerDuty notification on Slack
  [P1] Agent code-reviewer failure rate 38% in 5min
 
02:14 — Acknowledged on phone
02:15 — Opened Manager Surface, fetched latest runIds for agent_id=code-reviewer
02:16 — Reviewed top 3 traces -> all show same error: "ToolTimeout: github.create_review"
02:18 — Triage complete: pattern=ToolTimeout, user_impact=yes (CI is blocked)
 
02:19 — L2-A: applied timeout-relaxation flow
   $ npx tsx ops/mitigate.ts \
       --agent code-reviewer \
       --pattern tool-timeout \
       --duration 60
   -> Bumped timeout from 30s to 90s temporarily
 
02:21 — Failure rate dropped from 38% to 4%, recovery confirmed
02:22 — Posted "[L2 done] mitigated, monitoring" to Slack
02:23 — Created next-day ticket for permanent fix (GitHub API rate limiting)
02:25 — Back to bed

Without a runbook, this same incident takes me 45 minutes on average. With one, 11 minutes. Those 34 minutes are both the duration of user impact and the duration of lost sleep.

Anti-Patterns From Real Postmortems

Beyond the pitfalls section, here are anti-patterns that show up repeatedly in postmortems I've read across teams running Antigravity Agents in production. Each one cost someone a real outage.

The first anti-pattern is building one universal Agent for everything. The instinct to consolidate is reasonable: fewer agents seem easier to operate. The reality is the opposite. A single Agent that does code review, customer support, and data extraction has a runbook that's three runbooks crammed into one, and every incident requires figuring out which "mode" was failing. Split your Agents along clear functional boundaries even if it means more deployment surface. The blast radius of a failure stays contained, and the runbook stays focused.

The second is trusting the model name as a proxy for behavior. "We use Gemini 3.1 Pro" tells you almost nothing about how your Agent will behave under load. The same model with a slightly different system prompt, temperature, or tool set produces dramatically different failure modes. Pin your runbook to the actual Agent configuration, not to the underlying model. When you upgrade models, treat it as a configuration change that requires its own observation period before you trust the existing alerts.

The third is using human-readable Agent IDs that change. I've seen teams rename Agents during refactors and lose every alert mapping in the process. Pick an opaque, never-changing internal ID for each Agent (UUID or short hash) and let the human-readable name be a metadata field. This way, history-based queries like "how often did this Agent fail in the last quarter?" still work after the team renames it.

The fourth is alerting on individual model API errors. Models will return errors. Networks will hiccup. If you alert on every single error, you'll deafen yourself to actual incidents. Always alert on aggregates — failure rates over a window, cost over a window, latency percentiles over a window — never on a single event. The exception is the loop detector, which fires on a counter pattern but is itself a windowed signal.

The fifth is postmortems that blame people. The whole point of a blameless postmortem is to surface the systemic conditions that allowed the failure. "Masaki forgot to set the rate limit" is useless. "The deployment template had no required field for rate limits, so it was easy to forget" is actionable. Train yourself, and your team if you have one, to convert blame statements into systemic statements. The runbook updates that come out of this reframing are usually the highest-leverage changes you'll make.

Iteration: How the Runbook Should Evolve

The first version of your runbook will be wrong. That's expected. What matters is the iteration discipline.

Every postmortem should produce at least one runbook delta. If a postmortem doesn't change the runbook, you missed something. Either the runbook already covered this case (in which case why did it happen?), or you didn't dig deep enough into the systemic causes. Make "runbook delta" a required field in the postmortem template and treat its absence as a smell.

Quarterly, do a runbook health check: walk through every section and ask "would I actually find this useful at 2am?" Any section you wouldn't trust gets rewritten or deleted. A 200-word runbook you trust beats a 2000-word runbook you don't.

Annually, archive postmortems older than 12 months into a separate index and reread them. Many of the issues from a year ago will have been solved by infrastructure improvements you've made since. Some will have quietly resurfaced. The annual reread catches both, and gives you data for your next architectural priorities.

What to Do This Week

If you've read this far, here's a concrete sequence to put the runbook into practice over the next seven days.

On day one, set up the cost alert. This is the highest-impact 30 minutes you'll spend all week. Use the PromQL example from earlier and connect it to whatever paging service you have, or to your phone via PagerDuty's free tier if you're starting from nothing. The goal is that you'll be woken up before a runaway Agent costs you more than the rebuild cost of the entire project.

On day two and three, write the L1 triage checklist for your single most critical Agent. Tailor it to the specific tools that Agent uses and the specific Manager Surface views you'd consult. Time-box it to two hours per day; perfection is the enemy of having anything at all.

On day four, add the four other alert categories from earlier (failure rate, loop detection, eval quality, latency). Verify each one fires by deliberately triggering it in a test environment.

On day five, write the lightweight postmortem template into your team wiki or Notion. Run a 30-minute retrospective on the most recent issue you've experienced (even a minor one) using the template, just to test it.

On day six and seven, run a 90-minute solo game day using the template. Pick a plausible scenario, set the timer, and walk through L0 -> L1 -> L2 -> L3 with only the runbook in front of you. Record every gap. Spend the rest of the time fixing those gaps.

After this single week, you'll have more incident response infrastructure than 90% of indie AI Agent products. Your future self at 2am will thank you for the investment.

Wrap-Up — One Thing You Can Do Today

If you're going to do exactly one thing before tomorrow morning, set up cost alerts. Add the agentTokenSpend cumulative counter and a "page if more than $20 in 1 hour" rule to your production Agents. The full runbook is a multi-week project, but cost alerts take 30 minutes and prevent the largest financial damage.

Personally, the runbook framework I shared here took me over a year to refine in production. Don't aim for perfect — work through L0 -> L1 -> L2 -> L3 one tier at a time, and your nights will get quieter.

If you want to deepen the design, Antigravity Agent SRE: SLO and Error Budget Design and AI Agent Error Recovery and Resilient Pipeline Design pick up exactly where this article ends. To put a real tracing backbone behind the code samples, pair them with Antigravity OpenTelemetry: AI Observability Pipeline.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.