Designing Production Incident Runbooks for Antigravity Agents: A Practical Framework from Detection to Recovery
A complete guide to designing incident runbooks for production Antigravity Agents — detection, triage, mitigation, and postmortem, with working code you can drop into your stack today.
import RelatedArticles from "@/components/RelatedArticles";
"Slack went off at 2am, and my Antigravity Agent had hammered the same tool 200 times in a row before getting stuck." That actually happened to me. Half-asleep, I opened Manager Surface and burned 30 minutes just figuring out which trace to look at first. With a runbook in place, I could have reached the root cause in 5 minutes.
Operating AI Agents in production produces a different shape of failure than typical web services. CPU is healthy, the API returns 200, and yet "the thinking is broken." This article hands you the complete Antigravity-specific runbook framework I've sharpened across multiple products, code and all.
The design scales from solo indie projects to dozens of agents in production. By the end, you'll have concrete steps to triage within 5 minutes of the first alert and minimize user impact.
Why Antigravity Agents Need a Dedicated Runbook
Traditional web service runbooks are built around external signals: traffic spikes, broken database connections. Antigravity Agent failures look different.
"Working but wrong" failures are common: tool calls succeed, APIs return 200, but the output fails the business requirement
Failures don't propagate cleanly: in multi-agent setups, a Worker Agent can be broken while the Manager Agent reports "no issues"
Cost itself is an incident signal: by the time token consumption spikes, hundreds of dollars in damage may already be done
Reproducibility is low: identical inputs often refuse to reproduce the bug on a second run
In other words, runbooks built around HTTP status and CPU don't catch what matters. You need a flow centered on Antigravity-specific identifiers like runId, traceId, and agentSpanId.
When I started, I tried reusing generic SRE runbook templates and ended up burning brain cycles every incident on "which logs do I look at?" — fatal at 2am. Switching to an Agent-specific framework brought my median triage time from 18 minutes down to 4.
The Four-Tier Runbook Model — Lightweight Enough for Indie Devs
What I landed on is a four-tier structure. Heavy processes don't survive, so I optimized for what a solo developer can sustain.
L0: Detection — the layer that first notices something is wrong. Centralizes alert definitions and triggers
L1: Triage — within 5 minutes, decide "user-facing yes/no" and "self-healing yes/no"
L2: Mitigation — immediate actions to stop user impact: kill switch, fallback, traffic shedding
Each tier has its own checklist and code snippets. Nobody thinks clearly at 2am, so the runbook needs to do the thinking for you.
L0: Detection — Four Distinct Alert Categories
Start by separating Antigravity Agent metrics into four buckets. Mixing them on one dashboard guarantees you'll miss things.
// monitoring/agent-alerts.ts// Antigravity Agent alert definitions (OpenTelemetry + PromQL)import { Counter, Gauge, Histogram } from "@opentelemetry/api";// 1. Liveness: is the Agent running at all?export const agentRunCount = new Counter({ name: "antigravity_agent_run_total", help: "Total number of Agent runs by status", labelNames: ["agent_id", "status"], // status: success | failure | timeout});// 2. Quality: is the output meeting business requirements?export const agentEvalScore = new Histogram({ name: "antigravity_agent_eval_score", help: "Eval score (0-1) for Agent outputs", labelNames: ["agent_id", "eval_type"], buckets: [0.5, 0.7, 0.8, 0.9, 0.95],});// 3. Cost: are tokens within budget?export const agentTokenSpend = new Counter({ name: "antigravity_agent_token_spend_usd", help: "Cumulative USD spend per Agent", labelNames: ["agent_id", "model"],});// 4. Loop: is the Agent hammering the same tool?export const agentToolCallStreak = new Gauge({ name: "antigravity_agent_tool_call_streak", help: "Consecutive identical tool calls (potential loop)", labelNames: ["agent_id", "tool_name"],});
The hard-won lesson from my own work: if you don't design cost alerts first, you can't undo the damage. I have personally torched $300 in a single night. Build a cumulative agentTokenSpend counter and a "page if more than $X per hour" rule into day one of your deployment.
Here's a corresponding PromQL alert ruleset, written for Cloudflare or Grafana Cloud.
# monitoring/alerts.yml# Production alert rules for Antigravity Agentsgroups: - name: antigravity_agent_alerts interval: 30s rules: # 1. Liveness: failure rate > 20% over 5 minutes - alert: AgentFailureRateHigh expr: | ( sum(rate(antigravity_agent_run_total{status="failure"}[5m])) by (agent_id) / sum(rate(antigravity_agent_run_total[5m])) by (agent_id) ) > 0.2 for: 5m labels: severity: page runbook: agent-failure-rate annotations: summary: "Agent {{ $labels.agent_id }} failure rate > 20%" # 2. Cost: more than $20 in one hour (indie scale) - alert: AgentTokenSpendBurst expr: | increase(antigravity_agent_token_spend_usd[1h]) > 20 for: 5m labels: severity: page runbook: agent-cost-burst annotations: summary: "Agent {{ $labels.agent_id }} burned ${{ $value }} in 1h" # 3. Loop: same tool called 30 times in a row - alert: AgentToolLoopDetected expr: antigravity_agent_tool_call_streak > 30 for: 1m labels: severity: page runbook: agent-tool-loop annotations: summary: "Agent {{ $labels.agent_id }} stuck on {{ $labels.tool_name }}" # 4. Quality: median eval score below 0.7 over 1 hour - alert: AgentQualityDegraded expr: | histogram_quantile(0.5, sum(rate(antigravity_agent_eval_score_bucket[1h])) by (le, agent_id)) < 0.7 for: 15m labels: severity: ticket runbook: agent-quality-drop annotations: summary: "Agent {{ $labels.agent_id }} median eval score < 0.7"
severity: page interrupts the on-call immediately; severity: ticket is fine for next business day. Marking everything as page is how you guarantee you'll miss the one that matters. I learned the hard way to draw that line carefully.
L1: The 5-Minute Triage Protocol
When the alert fires, you want to answer three questions within 5 minutes:
Is there user impact? (If yes, jump to L2 immediately.)
Is automatic recovery possible? (If yes, retry first.)
Is there blast-radius risk? (Other agents, downstream services.)
The Markdown template below distills this flow. Drop it into Slack Workflow Builder or Notion.
## Triage Checklist (complete within 5 minutes)- [ ] Capture the **alert ID** (e.g., `AgentFailureRateHigh-2026-04-28-02-15`)- [ ] In Manager Surface, open the top 3 `runId`s for the affected `agent_id`- [ ] Identify the error pattern: - [ ] `ToolTimeout` -> L2-A: timeout-relaxation flow - [ ] `RateLimitExceeded` -> L2-B: backoff-extension flow - [ ] `MaxIterationsReached` -> L2-C: loop-detection flow - [ ] `EvalScoreDrop` -> L2-D: model-rollback flow - [ ] Other -> L2-E: consider firing the kill switch- [ ] Search the `is_user_facing` field with `cmd+K`: - "yes": fire L2 within 60 seconds - "no": move to L3 within 15 minutes- [ ] Check blast radius: - [ ] Failure rate of other agents using the same tool (PromQL: `rate(antigravity_agent_run_total{status="failure", tool_name="<name>"}[5m])`) - [ ] Response time of downstream services (DB, third-party APIs)- [ ] Post `[L1 done] agent_id=xxx pattern=ToolTimeout user_impact=yes` to **#incidents-agent**
I refined this checklist by literally running 5-minute drills against past incidents. The trick is to write specific GUI actions like "open the top 3 runIds in Manager Surface." Vague advice like "check the logs" is useless when you're half-asleep.
L2: A Mitigation Library — Kill Switch in One Line
If user impact is happening, stop the bleeding before investigating. Have these five mitigation patterns ready for Antigravity Agents.
The deliberate design choice here: don't let Slack failure abort the mitigation. Slack itself going down during an incident is more common than you'd think. Wrap the notification in try/catch and let the actual mitigation always complete.
The mandatory TTL is equally important. It prevents the classic mistake of manually setting enabled = false and forgetting about it until next week. Force automatic recovery within 24 hours; if you need longer, you have to make a deliberate choice to extend.
L3: Recovery — Tracing Root Cause from the Trace ID
Once the bleeding stops, you investigate root cause and ship a permanent fix. With Antigravity Agents, the traceId is your strongest weapon: you can replay the "trail of thought."
Seeing Tool calls: 247 immediately flags an anomaly. Healthy runs usually sit between 5 and 30 calls.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦You can keep a battle-tested runbook template that lets you triage a runaway Agent at 2am within 5 minutes
✦You'll learn concrete patterns for tying detection, mitigation, and recovery to Antigravity's traceId and Manager Surface so the runbook actually fits this platform
✦You'll have a 30-minute lightweight postmortem format that an indie developer can sustain — and a system that prevents the same incident from happening twice
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Postmortem — A 30-Minute Lightweight Format That Actually Survives
Most developers know postmortem culture matters — but writing the full "5 Whys" plus a detailed timeline takes 3 hours, so it doesn't survive the indie scale. Here's the stripped-down template I keep coming back to.
# Postmortem: <Incident Title> (YYYY-MM-DD)## Impact- Duration: HH:MM - HH:MM (N minutes total)- Users affected: ~N people / N% of total- Money lost: $N (token consumption from Agent runaway)- Trust lost: subjective 1-5## Timeline (concise)- HH:MM detected (alert: AgentFailureRateHigh)- HH:MM triage complete (pattern: ToolTimeout)- HH:MM mitigation applied (kill switch)- HH:MM recovery confirmed## Root Cause3-5 lines. Capture both the technical cause and the structural reason it was allowed to happen.## Direct Fixes- [ ] Fix PR: #1234- [ ] Deployed## Prevention (track with DA: Done / WIP / TBD)- [DA] Tightened alert threshold from 20% to 10%- [WIP] Reduce tool_call_streak limit from 30 to 15- [TBD] Add periodic Eval for the same pattern## Runbook Updates- [ ] Added "ToolTimeout cascade detection" to L2 patterns- [ ] Checklist now includes "blast-radius check on the same tool"## Lessons (3 lines)- What we got wrong- What we'll change next- What I want my future self / teammates to know
The heart of this format is the "Lessons (3 lines)" section. Trying to write everything makes you write nothing. Constrain yourself to three lines so that, six months later, you can still recall what you were thinking that night.
Common Mistakes — Landmines I've Stepped On
A round-up of mistakes I personally made, so you don't have to take the same calls at 2am.
Pitfall 1: Sending Alerts to a Personal DM
Sending notifications to a personal DM means you sleep through them. Sending them to a shared channel means nobody owns them. Force a rotation from day one. Even if you're a one-person team, use the free tiers of PagerDuty or Opsgenie and enable phone + push notifications. Slack-only alerts will lose you the most painful incidents.
Pitfall 2: "Just Disable It" as a Kill Switch
Setting enabled = false and walking away tends to leave it that way until next week. Make TTL mandatory, with auto-recovery at 24 hours. If user impact reappears, that's a signal "the fix isn't ready" — and forcing yourself to extend the kill switch creates the right pressure to actually finish the fix.
Pitfall 3: Postponing Cost Alerts
"Let's get it running first and add monitoring later" is how you torch hundreds of dollars in one night. I've personally lost $312 to a runaway loop. The cumulative agentTokenSpend counter and the "page if more than $X in 1h" rule are non-negotiable on day one — even in the most minimal setup.
Pitfall 4: "Let Me Just Read All the Logs" During Triage
This was my biggest reflection point. Scrolling through 500 spans tells you nothing. Force yourself to look at the first 3, the last 3, and only the failed spans. Trying to be exhaustive freezes your thinking.
Pitfall 5: Trying to Write the "Perfect" Postmortem
The moment you decide "I'll write it properly," you stop writing it. Use the lightweight template above with a 30-minute timer; "imperfect but saved" is the only goal that matters. Edit it later if you want. Unwritten postmortems are equivalent to no postmortems, and the cost of repeating the same incident in 6 months is far higher.
Pitfall 6: Calling "Prevention" Done When the PR Merges
Merging a fix doesn't update the runbook. Always pair postmortem work with editing the relevant runbook section. I keep my runbook in Notion and the postmortem template has a mandatory "Runbook Updates" field exactly for this.
Wiring the Runbook into Manager Surface
A runbook lives or dies by how easily you can reach it during the actual incident. Antigravity's Manager Surface gives you a few hooks worth setting up before the first 2am call.
The two most useful hooks I've added are per-agent runbook links and trace-aware action buttons. The first puts a clickable runbook URL on every Agent's detail page. The second exposes mitigation actions (kill switch, throttle, model downgrade) as buttons that already know the agentId and current runId you're looking at.
Once registered, your future-self at 2am sees a "Open Runbook" button right next to the trace. That removes one more piece of cognitive load when you're least equipped to handle it.
Trace-aware action buttons matter even more. The pattern that has worked for me is to expose mitigation actions as Manager Surface custom commands, each one already wired to call the AgentMitigator library shown earlier. The result: kill switch becomes a single click on the trace you're already looking at, instead of a context switch into a terminal where you have to remember the right command flags.
Testing the Runbook with Game Days
Even a beautifully written runbook decays. APIs change, alerts get tweaked, the on-call rotation shifts. The cheapest insurance is a quarterly game day — a deliberate exercise where you simulate an incident and run through the full L0 -> L1 -> L2 -> L3 cycle.
I run game days as a 90-minute solo exercise. The format is simple:
Pick a past incident or invent a plausible scenario
Set a kitchen timer for 5 minutes and start at L1 with only the runbook
Note where the runbook fails to give you the next step within 30 seconds
After completing the simulation, edit the runbook to fix every gap you found
The result is that the runbook gradually becomes a near-perfect script. The first time I did this, my runbook had 11 gaps. By the third game day, it had 1.
If you have a teammate, pair game days are even better: one person plays "the system" by feeding fake alerts and trace data, the other plays "on-call." Switching roles keeps both people sharp on the operational pieces.
A surprising side effect: game days also surface gaps in your observability. If the runbook says "check the failure rate by tool" and the dashboard makes that hard to do, you've found a dashboard improvement, not just a runbook one.
Sizing Your Runbook to Your Stage
A common mistake is copying a Big Tech runbook template into a 1-person side project. The result is a 50-page document nobody reads. Right-size the runbook to where you are.
For solo indie projects with one or two production Agents, a single Notion page with the L0-L3 sections collapsed is enough. Keep it under 800 words total. The four alerts I described above plus the five mitigation patterns plus the lightweight postmortem template fit comfortably in that budget.
For a small team with five to ten Agents, split the runbook into a "global runbook" (incident process, severity definitions, postmortem template) and per-agent runbooks (the specific failure modes, mitigations, and on-call rotations for each Agent). The per-agent pages should still be short — under 500 words — and the global runbook should be the single source of truth for process.
For larger setups (dozens of Agents, multiple teams), invest in proper structured incident management: integrating PagerDuty or Opsgenie with Slack, using a tool like FireHydrant or incident.io to drive the process, and keeping the runbook generators in code rather than free-form documents. At this scale, a static runbook becomes a liability — you want incident automation that already knows which Agent failed and pre-fills the relevant context.
Severity Definitions That Don't Lie
Most teams adopt P1/P2/P3 severity tiers but never write down what makes an incident a P1. The result is that everything becomes a P2 because nobody wants to be the person to call something P1, and then the real P1 gets lost in the noise.
Be precise. Below is the severity table I currently use for Antigravity Agents — adapt the numbers, but keep the structure.
## Severity Definitions### P1 (page on-call immediately, 24/7)- User-facing failure rate above 20% for any Agent for more than 5 minutes- Cost burst above $50/hour for any single Agent- Tool loop pegged at 100+ identical calls- Data corruption suspected (Agent wrote bad data to a system of record)### P2 (page during business hours, ticket overnight)- User-facing failure rate 5-20% sustained for more than 30 minutes- Cost burst $20-50/hour- Eval median below 0.7 sustained for more than 2 hours- Latency P95 above 30s sustained for more than 30 minutes### P3 (ticket, fix in current sprint)- Eval median 0.7-0.85 sustained for more than 24 hours- Single failed Agent run with no propagation- Cost slowly trending upward without spike### P4 (backlog, fix when convenient)- Cosmetic issues, log spam, deprecation warnings- Performance degradation under 10% with no user impact
Notice that every threshold has both a magnitude and a duration. "Failure rate above 20%" alone would page you for every momentary blip. "Above 20% for more than 5 minutes" filters out noise while catching real fires. This pattern matters for every alert: if you don't have a duration, you don't have an alert, you have a paging machine.
Also notice that the P1 list explicitly includes cost and data corruption, not just availability. AI Agents fail in ways traditional services don't, and your severity tiers should reflect that.
Communication During the Incident
If you have any users at all, communication is half of incident response. The runbook should make it impossible to forget. I keep three communication checklists, one per phase.
When You First Confirm User Impact
Within 5 minutes of confirming user impact, post a status update somewhere users can see it: a status page, a pinned tweet, a banner inside the app. Don't wait until you understand the cause. The message can be as simple as:
We're investigating an issue affecting <feature>. We'll post the next update by HH:MM.
The "next update by HH:MM" is the most important part. It commits you to a follow-up and stops the "is anyone there?" anxiety from users.
During Mitigation
If mitigation takes more than 15 minutes, post an interim update:
We've identified the issue and are applying a fix. <feature> is currently <degraded/unavailable>.We'll post the next update by HH:MM.
Resist the temptation to share root-cause guesses publicly until you're confident. "We think it might be the database" turns into "they said it was the database" which turns into a Twitter thread you can't take back.
After Recovery
Once recovery is confirmed, post a final status update:
The issue affecting <feature> has been resolved as of HH:MM. We'll publish a postmortem within <timeframe>.
Then actually write the postmortem within the promised timeframe. Public postmortems build trust precisely because most companies don't follow through on the promise. For an indie product, even a 200-word "what happened, what we changed" note builds disproportionate credibility.
Building a "Day 0 Incident Kit"
If you're starting from zero, the temptation is to ignore incident response entirely until you have your first incident. Don't. Here's the smallest possible kit that gives you a fighting chance, sized for a solo indie launch.
The four hours to spend before your first production deployment are: one hour wiring up the four alert categories from earlier (with the cost alert being the most important), one hour writing the L1 triage checklist tailored to your specific Agent, one hour creating the lightweight postmortem template in Notion or wherever you write, and one hour setting up phone-capable on-call notifications (PagerDuty's free tier or Opsgenie's free tier both work).
This single afternoon will save you from the most common solo-developer disaster: a runaway Agent that costs you a thousand dollars overnight while you sleep. I've watched several friends learn this lesson the expensive way. The Day 0 kit is the cheapest insurance policy in indie AI Agent development.
Connecting the Runbook to Your Eval Harness
A subtle but high-leverage move: hook your runbook into your eval harness so that fixes prove themselves before they ship. Every postmortem should include a "regression eval" — a small set of inputs that would have triggered the original failure — added to the harness as part of the fix.
The mechanics are simple. After a postmortem identifies the failure mode, you write 5-20 representative inputs that would have caused the original incident. Those inputs go into a regression-evals/ folder, named after the postmortem (e.g., regression-evals/2026-04-28-tool-timeout-loop.json). Your CI runs the regression evals on every PR; a regression in pass rate blocks merges.
The result is that your runbook indirectly grows your test suite. Each incident strengthens the safety net that catches the next one. Over a year, this produces a defense-in-depth that no upfront test plan could match. I find this is one of the most underrated benefits of taking incident response seriously: it's not just about the current fire, it's about systematically reducing the cone of possible future fires.
A specific anti-pattern to avoid: don't add the failing input as a single test case and call it done. Real failure modes are clusters, not points. Spend 15 minutes generating 10-20 nearby inputs (different prompts, edge cases, length variations) and add the whole cluster. Otherwise you've trained your Agent to pass exactly one test, while the underlying weakness remains.
Worked Example: "Slack Goes Off at 2am" Walkthrough
Let me run through everything above with a real scenario from my own logs. Use it as a stress test for your own runbook.
HH:MM=02:14 — PagerDuty notification on Slack [P1] Agent code-reviewer failure rate 38% in 5min02:14 — Acknowledged on phone02:15 — Opened Manager Surface, fetched latest runIds for agent_id=code-reviewer02:16 — Reviewed top 3 traces -> all show same error: "ToolTimeout: github.create_review"02:18 — Triage complete: pattern=ToolTimeout, user_impact=yes (CI is blocked)02:19 — L2-A: applied timeout-relaxation flow $ npx tsx ops/mitigate.ts \ --agent code-reviewer \ --pattern tool-timeout \ --duration 60 -> Bumped timeout from 30s to 90s temporarily02:21 — Failure rate dropped from 38% to 4%, recovery confirmed02:22 — Posted "[L2 done] mitigated, monitoring" to Slack02:23 — Created next-day ticket for permanent fix (GitHub API rate limiting)02:25 — Back to bed
Without a runbook, this same incident takes me 45 minutes on average. With one, 11 minutes. Those 34 minutes are both the duration of user impact and the duration of lost sleep.
Anti-Patterns From Real Postmortems
Beyond the pitfalls section, here are anti-patterns that show up repeatedly in postmortems I've read across teams running Antigravity Agents in production. Each one cost someone a real outage.
The first anti-pattern is building one universal Agent for everything. The instinct to consolidate is reasonable: fewer agents seem easier to operate. The reality is the opposite. A single Agent that does code review, customer support, and data extraction has a runbook that's three runbooks crammed into one, and every incident requires figuring out which "mode" was failing. Split your Agents along clear functional boundaries even if it means more deployment surface. The blast radius of a failure stays contained, and the runbook stays focused.
The second is trusting the model name as a proxy for behavior. "We use Gemini 3.1 Pro" tells you almost nothing about how your Agent will behave under load. The same model with a slightly different system prompt, temperature, or tool set produces dramatically different failure modes. Pin your runbook to the actual Agent configuration, not to the underlying model. When you upgrade models, treat it as a configuration change that requires its own observation period before you trust the existing alerts.
The third is using human-readable Agent IDs that change. I've seen teams rename Agents during refactors and lose every alert mapping in the process. Pick an opaque, never-changing internal ID for each Agent (UUID or short hash) and let the human-readable name be a metadata field. This way, history-based queries like "how often did this Agent fail in the last quarter?" still work after the team renames it.
The fourth is alerting on individual model API errors. Models will return errors. Networks will hiccup. If you alert on every single error, you'll deafen yourself to actual incidents. Always alert on aggregates — failure rates over a window, cost over a window, latency percentiles over a window — never on a single event. The exception is the loop detector, which fires on a counter pattern but is itself a windowed signal.
The fifth is postmortems that blame people. The whole point of a blameless postmortem is to surface the systemic conditions that allowed the failure. "Masaki forgot to set the rate limit" is useless. "The deployment template had no required field for rate limits, so it was easy to forget" is actionable. Train yourself, and your team if you have one, to convert blame statements into systemic statements. The runbook updates that come out of this reframing are usually the highest-leverage changes you'll make.
Iteration: How the Runbook Should Evolve
The first version of your runbook will be wrong. That's expected. What matters is the iteration discipline.
Every postmortem should produce at least one runbook delta. If a postmortem doesn't change the runbook, you missed something. Either the runbook already covered this case (in which case why did it happen?), or you didn't dig deep enough into the systemic causes. Make "runbook delta" a required field in the postmortem template and treat its absence as a smell.
Quarterly, do a runbook health check: walk through every section and ask "would I actually find this useful at 2am?" Any section you wouldn't trust gets rewritten or deleted. A 200-word runbook you trust beats a 2000-word runbook you don't.
Annually, archive postmortems older than 12 months into a separate index and reread them. Many of the issues from a year ago will have been solved by infrastructure improvements you've made since. Some will have quietly resurfaced. The annual reread catches both, and gives you data for your next architectural priorities.
What to Do This Week
If you've read this far, here's a concrete sequence to put the runbook into practice over the next seven days.
On day one, set up the cost alert. This is the highest-impact 30 minutes you'll spend all week. Use the PromQL example from earlier and connect it to whatever paging service you have, or to your phone via PagerDuty's free tier if you're starting from nothing. The goal is that you'll be woken up before a runaway Agent costs you more than the rebuild cost of the entire project.
On day two and three, write the L1 triage checklist for your single most critical Agent. Tailor it to the specific tools that Agent uses and the specific Manager Surface views you'd consult. Time-box it to two hours per day; perfection is the enemy of having anything at all.
On day four, add the four other alert categories from earlier (failure rate, loop detection, eval quality, latency). Verify each one fires by deliberately triggering it in a test environment.
On day five, write the lightweight postmortem template into your team wiki or Notion. Run a 30-minute retrospective on the most recent issue you've experienced (even a minor one) using the template, just to test it.
On day six and seven, run a 90-minute solo game day using the template. Pick a plausible scenario, set the timer, and walk through L0 -> L1 -> L2 -> L3 with only the runbook in front of you. Record every gap. Spend the rest of the time fixing those gaps.
After this single week, you'll have more incident response infrastructure than 90% of indie AI Agent products. Your future self at 2am will thank you for the investment.
Wrap-Up — One Thing You Can Do Today
If you're going to do exactly one thing before tomorrow morning, set up cost alerts. Add the agentTokenSpend cumulative counter and a "page if more than $20 in 1 hour" rule to your production Agents. The full runbook is a multi-week project, but cost alerts take 30 minutes and prevent the largest financial damage.
Personally, the runbook framework I shared here took me over a year to refine in production. Don't aim for perfect — work through L0 -> L1 -> L2 -> L3 one tier at a time, and your nights will get quieter.
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.