Canary Deployment with Auto-Rollback for AI Agents — Protecting Production with Antigravity and Burn-Rate SLOs
A practical playbook for shipping new AI agent versions through canary deployment on Antigravity, with automatic rollback driven by burn-rate SLOs. Includes a lightweight setup that solo developers can sustain.
In the summer of 2024 I pushed what looked like a small tweak to one of my wallpaper apps — I "just tightened up" an image generation prompt template. The next morning the review section was on fire. The actual prompt change was harmless. What broke production was a single line of conditional logic I had rewritten without noticing it disabled a fallback path. The AdMob session value dropped to roughly 60% of its usual level within half a day, and the few hours of recovery time quietly erased tens of thousands of yen in revenue that would otherwise have closed out the day. Nothing teaches you to respect "full traffic switches" like watching an AI-touching feature fail in front of real users.
Agents drift in ways deterministic code does not. Even when regression tests pass, the long tail of real user inputs in production will find an angle the test suite never imagined. That's exactly why "release gradually, retreat fast when things slip" needs to be a systematized property of your pipeline rather than a vibe. This article walks through how I use Antigravity as the hub for a canary deployment plus burn-rate SLO auto-rollback setup, sized for a workload that a solo developer can realistically maintain. I have been shipping mobile apps as an indie developer since 2014 alongside my visual art practice, and the cumulative install base now sits north of 50 million downloads. The moment I switched from "find out something broke after the fact" to "detect the early signal of degradation and let automation roll it back," the psychological weight of deployments dropped dramatically.
Treat canaries as a distinct concern from A/B tests
A/B tests answer the question of which variant is better, assuming both candidates are already acceptable. Canary deployment answers a different question: is the new version actually safe to expose at all? For AI agents, the evaluation isn't a single success rate — you have to watch response quality, cost, latency, and safety simultaneously.
The line I draw in my own indie operation looks like this.
Canary phase: prove the new version is not broken. Traffic is staged from 1% to 10% to 50% to 100%, and at every step the burn-rate must stay below threshold.
A/B test phase: only after the canary clears do I compare quality. Both variants run in parallel until sample size is sufficient to make a real decision.
When the two get blended, the "I want to see if my optimization landed" impulse takes over and obviously broken changes leak through. Forcing the order — first prove it isn't broken, then evaluate whether it is better — at the SKILL.md and code review level was the single change that cut down incidents the most for me.
Defining SLI, SLO, error budget, and burn-rate for agents
Canary decisions cannot run on intuition. The four definitions below need to be fixed in writing before you ship anything.
SLI (Service Level Indicator): the metric you measure. For agent workloads I track at least four streams: task completion rate, hallucination detection rate, P95 latency, and cost per request.
SLO (Service Level Objective): the target value for each SLI. For example: "completion rate ≥ 99.0%, P95 latency ≤ 8,000ms, cost per request ≤ $0.012."
Error budget: the total volume of failure permitted in a window. With a monthly 99% SLO, the budget is 1.0% of monthly requests.
Burn-rate: how quickly the error budget is being spent. If 1% of a monthly budget is consumed in one hour, the burn-rate is 7.2 when normalized to the month.
Critically, in canary mode you compute the burn-rate against the canary slice only, not the global traffic. If your canary holds 5% of traffic, the relevant failure rate is the failure rate inside that 5%. Mixing those denominators is how teams miss a state where the global numbers look fine but 30% of users actually routed to the new version are seeing failures.
For the image generation agent in my wallpaper app the SLOs sit at:
The numbers come from working backward from how much risk the business can absorb. A larger MAU base means even a 0.1% degradation translates into a sizeable cohort of impacted users, so the SLO has to be set conservatively as the audience grows.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Replace the 'someone will notice eventually' dread of shipping a new agent version with burn-rate alerts and automated rollback that protect production before harm compounds
✦Walk away with working code for sticky canary traffic splitting, SLI/SLO definitions, multi-window burn-rate alerts, and rollback triggers that drop straight into your Antigravity workflow
✦Set up a release process small teams and solo developers can sustain, so you can move from anxious monthly deployments to confident weekly ones even at a several-hundred-thousand DAU scale
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Implementing canary traffic splitting on Antigravity
Splitting traffic lives either at the edge (Cloudflare Workers) or at your entry-point middleware (Next.js / Hono). The non-negotiable requirement is stickiness — the same user must always route to the same version. Random per-request splitting alternates a user between the two builds and produces an incoherent experience.
When I ask Antigravity's Agent mode for this design, it produces a hash-based deterministic router. Here is the version I actually run in production.
// src/lib/canary-router.ts// Decide an agent version per-user with a sticky, deterministic hash.// The same userId always lands on the same version.import { createHash } from "node:crypto";export type AgentVersion = "stable" | "canary";export interface CanaryConfig { canaryPercent: number; // 0-100 share for canary forceFlag?: AgentVersion; // emergency override (set during auto-rollback)}export function selectAgentVersion( userId: string, config: CanaryConfig): AgentVersion { // Honour an active rollback first. if (config.forceFlag === "stable") return "stable"; if (config.canaryPercent <= 0) return "stable"; if (config.canaryPercent >= 100) return "canary"; // Hash userId to a stable 0-99 bucket. Never use Math.random here. const h = createHash("sha256").update(userId).digest(); const bucket = h.readUInt16BE(0) % 100; return bucket < config.canaryPercent ? "canary" : "stable";}// Expected behaviour:// canaryPercent=10 routes roughly 10% of distinct users to canary.// A given userId remains on the same version across the session.
You will not get the sticky property from Math.random(). When the Agent mode emits random-based code (it sometimes does, especially on first pass), reject it and ask for a hash-based rewrite explicitly. If you let randomness slip in, the burn-rate readings later on can no longer be interpreted at the user level, which kneecaps your incident analysis when something does break.
Store the canary percentage in Cloudflare KV or a Durable Object so you can flip it from an Antigravity Agent on demand. Avoid changing the percentage by editing code and redeploying — operational levers should be config flips, not code pushes.
// src/lib/canary-config.ts// Load canary config from Cloudflare KV, propagated to the edge in ~60s.export async function getCanaryConfig( env: { CANARY_KV: KVNamespace }): Promise<CanaryConfig> { const raw = await env.CANARY_KV.get("agent:image-gen:canary"); if (!raw) { // Fail-safe: if KV is unreachable, fall back to stable 100%. return { canaryPercent: 0 }; } try { return JSON.parse(raw) as CanaryConfig; } catch { return { canaryPercent: 0 }; }}// Expected behaviour:// A KV write propagates to all edge POPs within ~60 seconds// (subject to Cloudflare's cache invalidation behaviour).
The "if KV fails, route 100% to stable" branch is non-negotiable. Throwing or random-routing on KV failure causes the canary to spiral out of control during the exact moments your platform is least healthy.
Wiring telemetry as an Antigravity Agent task
Canary decisions need near-real-time metrics. In my setup, agent execution logs fan out into both Cloudflare Workers Analytics Engine and BigQuery. Decision-making runs against Analytics Engine because its 30-second granularity queries are fast enough for rollback timing.
// src/lib/agent-telemetry.ts// Record agent executions into AnalyticsEngine.// Version goes into the index so per-version aggregations are cheap.export interface AgentExecutionResult { version: AgentVersion; userId: string; success: boolean; latencyMs: number; costUsd: number; errorCode?: string;}export function recordAgentExecution( env: { TELEMETRY: AnalyticsEngineDataset }, result: AgentExecutionResult): void { env.TELEMETRY.writeDataPoint({ indexes: [result.version], // pivots queries by canary vs stable blobs: [ result.userId, result.success ? "success" : "failure", result.errorCode ?? "", ], doubles: [result.latencyMs, result.costUsd], });}// Expected query against this dataset:// SELECT _sample_interval, blob2, AVG(double1), AVG(double2)// FROM telemetry WHERE index1 = 'canary' AND timestamp > NOW() - INTERVAL 5 MINUTE// returns 5-minute averages for canary latency and cost.
When you ask Antigravity Agent to generate this, the prompt to add is "consolidate the metrics recording into a single function that every agent call path must funnel through." Without that constraint the Agent often creates one-off telemetry helpers per agent file, and consolidating them later wastes far more time than enforcing it up front.
Burn-rate math and multi-window alerts
The core of the canary decision is computing burn-rate. Below is a simplified multi-window, multi-burn-rate alert (the pattern from the Google SRE book) adapted for AI agent workloads.
// src/lib/burn-rate.ts// Calculate burn-rate for the canary stream over short and long windows.// Rollback when both windows are above their thresholds simultaneously.export interface SLOTarget { windowDays: number; // e.g. 30 for a monthly SLO successRateTarget: number; // e.g. 0.985 for 98.5%}export interface MetricsWindow { totalRequests: number; failedRequests: number; windowSeconds: number;}/** * Burn-rate = observed failure rate divided by allowed failure rate, * normalized to the SLO window. * 1.0 means "spending the budget exactly in one window length." * 10.0 means "spending one month of budget in three days" (dangerous). */export function computeBurnRate( metrics: MetricsWindow, slo: SLOTarget): number { if (metrics.totalRequests === 0) return 0; const observedErrorRate = metrics.failedRequests / metrics.totalRequests; const allowedErrorRate = 1 - slo.successRateTarget; if (allowedErrorRate <= 0) return Number.POSITIVE_INFINITY; return observedErrorRate / allowedErrorRate;}/** * Both windows must exceed threshold for a rollback page. * SRE book defaults: 5min window > 14.4 AND 1hr window > 6.0 */export function shouldRollback( short: MetricsWindow, // 5-minute window long: MetricsWindow, // 1-hour window slo: SLOTarget): boolean { const shortBR = computeBurnRate(short, slo); const longBR = computeBurnRate(long, slo); return shortBR > 14.4 && longBR > 6.0;}// Expected behaviour:// For SLO 99%, observing 2% failure in the last 5min and 1.5% in the last 1hr// yields shortBR > 14.4 AND longBR > 6.0 → rollback triggered.
The AND across two windows is the part you cannot skip. Short-window-only alarms misfire on transient spikes; long-window-only alarms react too slowly to save anything. The conjunction gives you "currently abnormal AND has been burning budget for long enough that the abnormality is real."
The defaults of 14.4 and 6.0 come from the SRE book. AI agents usually run looser SLOs (around 99%), so I dial mine down to 12.0 and 4.5 for the wallpaper image agent. Earlier stops cost less than longer recoveries when the customer is a passive end user who does not retry.
The rollback trigger itself
Once the burn-rate verdict is in, you need to physically drive the canary traffic to zero. In my deployments this lives as an Antigravity Background Agent on a 5-minute schedule.
// src/agents/canary-watchdog.ts// Antigravity Background Agent, fired every 5 minutes.// Monitors canary burn-rate and auto-rolls back when thresholds breach.import { queryAnalyticsEngine } from "./telemetry-query";import { computeBurnRate, shouldRollback } from "../lib/burn-rate";import { setCanaryConfig } from "../lib/canary-config";const SLO = { windowDays: 30, successRateTarget: 0.985 };export async function canaryWatchdogTick(env: Env): Promise<void> { const short = await queryAnalyticsEngine(env, "canary", 300); // 5 minutes const long = await queryAnalyticsEngine(env, "canary", 3600); // 1 hour // Skip when sample size is too small (prevents false positives at low traffic). if (short.totalRequests < 50 || long.totalRequests < 500) { console.log(`[watchdog] insufficient samples, skipping`); return; } if (shouldRollback(short, long, SLO)) { const shortBR = computeBurnRate(short, SLO); const longBR = computeBurnRate(long, SLO); console.error( `[watchdog] ROLLBACK triggered: shortBR=${shortBR.toFixed(1)} longBR=${longBR.toFixed(1)}` ); // Force canary back to zero with the override flag set. await setCanaryConfig(env, { canaryPercent: 0, forceFlag: "stable", }); // Notify (Slack / Discord / email — whichever fits your ops loop). await notifyRollback(env, { shortBR, longBR }); }}// Expected behaviour:// Runs every 5 minutes, rolls canary back to 0% the instant both thresholds breach.// Skips when canary traffic is below 50 reqs / 5min to preserve statistical significance.
The "skip when sample size is too small" branch is what keeps overnight low-traffic windows from generating spurious rollbacks. Most mobile-app businesses see overnight traffic at roughly 10% of peak, so without this check the watchdog ends up oscillating and operations falls apart.
To deploy this as an Antigravity Background Agent, set cron: "*/5 * * * *" in your agent.config.json. In Agent mode you can scaffold the Worker that invokes canaryWatchdogTick, wire it to its dataset, and add the notification path in a single natural-language session.
Model the deployment as an explicit state machine
A canary deployment is a state transition machine. I encode mine as six discriminated-union states so the compiler rejects illegal transitions.
Encoding the state as a discriminated union lets you let Antigravity write supporting code with confidence — the type checker blocks any new state names the Agent might invent on the fly. With a plain status: string field, Agents will quietly invent values like "warming-up" or "holding" and your state machine erodes from inside.
Operationally I enforce a minimum 30-minute soak at each canary step. Anything shorter and the burn-rate sample size stays too small to be meaningful. For payments-adjacent agents I stretch the soak to two hours per step. Higher blast radius → longer soak.
Failure modes I've actually hit
I've fallen into every trap below at least once. Naming them might keep you from doing the same.
Random-based routing: covered already, but worth restating. Math.random() produces a user who flips between versions, and that user's experience is incoherent. Always hash the user ID.
Caches that mix versions: if your CDN serves a canary response into the stable cache key, downstream readers see whichever version the cache happened to fill. Either include the version in the Vary header or in the URL path.
Metric granularity too coarse: hourly data is far too slow. Aim for 1-minute aggregation minimum, 30-second if you can. Cloudflare Analytics Engine and Datadog's Live Tail are both within reach of solo budgets.
Computing burn-rate against the whole pipeline: a canary at 5% inside a healthy 95% will look fine in aggregate even while it's burning 30% inside the canary slice. Always compute against the canary cohort.
Forgetting to clear forceFlag after rollback: leave the override on and the next deployment silently routes everything to stable no matter what you set canaryPercent to. Couple the flag cleanup to the rolled-back state transition.
Watchdog not idempotent: if your Agent fires twice in quick succession it can issue conflicting decisions. Store the last decision timestamp in KV and short-circuit when the gap is too small.
Combining canary with a shadow-mode rollout raises your safety floor further. Shadow mode mirrors production traffic to the new version while still returning the stable response to the user. Running a week of shadow before the 1% canary lets you compare diffs without ever exposing real users, which buys priceless calm before the first real switch.
Test your rollback before you actually need it
A rollback mechanism you have never exercised is not a rollback mechanism — it is a promise. The first time you trust your burn-rate watchdog should not be during the live incident.
I run two kinds of drills before I trust a canary system in production.
The first is a synthetic failure injection. I deploy the canary version with a deliberate 3% failure injected by feature flag — a small wrapper that returns an error for one in roughly 30 canary requests — and confirm that the watchdog flips the canary off within one observation cycle. The point is not that the failure is realistic. The point is that the wiring between metrics, burn-rate computation, KV write, and edge propagation actually completes end to end.
// src/lib/synthetic-failure.ts// Feature-flag-controlled error injection for canary drills.// Only enabled in pre-prod and only for the canary version.export function maybeInjectFailure( version: AgentVersion, env: { FAILURE_RATE_KV: KVNamespace }): Promise<void> { if (version !== "canary") return Promise.resolve(); return env.FAILURE_RATE_KV.get("drill:failure-rate").then(rateStr => { const rate = parseFloat(rateStr ?? "0"); if (rate > 0 && Math.random() < rate) { throw new Error("synthetic_canary_failure"); } });}// Expected behaviour:// When `drill:failure-rate` is set to "0.03", roughly 3% of canary// requests throw `synthetic_canary_failure`, which the metrics layer// records as a failure. The watchdog should fire within two ticks.
The second drill is a "shadow rollback" against a real but well-understood agent. I let the canary serve real production traffic for five minutes, then manually flip forceFlag to stable from the Antigravity Agent console, and trace every layer to confirm: KV updated, edge caches refreshed, downstream observability shows the version field flipping back, the deployment state machine logs the transition. Doing this every quarter has caught two regressions for me — once when a Vary header change broke version stickiness, and once when a CDN cache had been silently serving a six-hour-old CanaryConfig due to a misconfigured TTL.
The metric that matters here is what I call "rollback latency": elapsed time from "watchdog decides to rollback" to "every edge POP serves stable." For my workloads I aim for under 90 seconds end to end. If you cannot measure that number, you do not know your blast radius — you only think you do.
Cost as a first-class rollback signal
Latency and success rate are the obvious SLIs. Cost per request is the silent one, and it is the one that hurt me the most before I started watching it explicitly.
There was a release in early 2025 where I added a "self-correction" loop to the wallpaper agent — when the first generated image fell below a quality threshold, the agent would request a refinement. The intent was straightforward, the test results looked great, and overall failure rates stayed roughly the same. What I missed in the canary window is that the new logic was triggering the refinement path for about 40% of requests instead of the expected 5%, mostly because of an edge case in how prompts were parsed. Cost per request roughly tripled. By the time I caught it, the canary had been at 10% for six hours and I had quietly burned through several thousand yen of inference budget.
After that, cost became one of the four mandatory burn-rate streams. The thresholds I use today are deliberately stricter than the failure-rate ones, because cost regressions hurt the business directly even when nothing technically breaks.
// src/lib/cost-burn-rate.ts// Cost is a first-class SLO; treat sustained cost regressions like failures.export interface CostWindow { totalRequests: number; totalCostUsd: number; windowSeconds: number;}export function computeCostRatio( canary: CostWindow, stable: CostWindow): number { if (canary.totalRequests === 0 || stable.totalRequests === 0) return 1; const canaryAvg = canary.totalCostUsd / canary.totalRequests; const stableAvg = stable.totalCostUsd / stable.totalRequests; if (stableAvg === 0) return Number.POSITIVE_INFINITY; return canaryAvg / stableAvg;}// Expected behaviour:// A ratio of 1.0 means cost per request is unchanged.// A ratio of 2.5 means each canary request costs 2.5x as much as stable.// I roll back at ratio > 1.5 sustained over a 30-minute window.
The interesting design choice here is to compare canary against the simultaneous stable cohort, rather than against a fixed historical baseline. That self-corrects for upstream price changes (a Gemini API rate adjustment, for example) and for genuine workload shifts that affect both versions equally. The signal you want is "canary is more expensive than stable right now," not "canary is more expensive than the same week last month."
Coordinating canaries with prompt versioning
Most agent releases are not "the whole agent changed." They are smaller: a single prompt was tightened, a tool description was rewritten, a system message added a guardrail. Treating those as full deployment events with their own canary process is overhead you cannot sustain at indie scale.
Instead, I treat prompt-only changes as a separate, lighter-weight track and let the full canary pipeline handle code or model changes. The split looks roughly like this.
Prompt-only changes: managed through the prompt versioning system, A/B tested without a full canary cycle. Burn-rate watchdog still observes, but with looser thresholds because the blast radius is smaller. Rollback is a single KV write reverting the active prompt version.
Code or model changes: full canary pipeline with the four-step traffic ramp, dual-window burn-rate, automated rollback, and state machine enforcement.
Drawing this line clearly in the SKILL.md keeps the team (or just your own future self) from treating every change identically and burning operational attention on changes that did not need it. The flip side is that you need to be honest with yourself when a "prompt change" is actually structural — adding a new tool reference, for example, is closer to a code change than a prompt change, even if the diff is only in the YAML file.
A lean version that solo developers can actually run
Everything above scales down. About 80% of the design is reusable on a tight budget.
One Cloudflare Analytics Engine dataset is enough: Datadog and Honeycomb are wonderful, but the fixed monthly cost adds friction for indies. Analytics Engine gives you 10M writes effectively free, which is plenty for burn-rate math.
A single Cron-triggered Worker is enough: you don't need a dedicated Durable Object for the watchdog. A 5-minute Cron trigger that runs the tick function costs tens of milliseconds of CPU per invocation.
Discord webhooks are enough for state notifications: Slack's paid plans are overkill at this stage. I pipe almost all of my indie project notifications into Discord channels and the visibility is more than enough.
Pair this with LLMOps production monitoring and prompt versioning and A/B testing and you get a small but real safety net. Folding in an agent evaluation framework on top sharpens regression detection further still.
Start with one small step today
You don't need to canary-deploy every agent tomorrow. My recommendation: pick the single agent whose failures hurt users the most, and build one Cloudflare Analytics Engine dashboard with its burn-rate. Just having the dashboard transforms how fast you respond when something goes wrong. From there you can layer in canary splitting, then auto-rollback, then the state machine — each step adds value on its own.
Agents drift like living things. If you can't stop them from drifting, you can at least make sure that when they drift in the wrong direction, the system rolls itself back without waiting for a human to notice. That has been my honest answer to maintaining production quality as a solo developer, and I hope it gives you a starting point you can adapt to your own setup. Thank you for reading.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.