Antigravity Agent Shadow Mode Production Rollout Guide — A Safer Way to Test New Versions
How to safely roll out new versions of an Antigravity AI agent by mirroring real production traffic to the new version without affecting users — design, implementation and rollout playbook.
If your stomach tightens every time you ship a new agent version, you are not alone. A small prompt tweak quietly degrades responses for a particular user segment. A model swap suddenly triples your monthly bill. After running multi-agent systems on Antigravity for some time, I have hit this kind of "you only see it after you ship" failure more times than I can count.
A/B tests and canary releases are useful, but agents have a quirk that classical web releases don't share: the output is probabilistic, and cost is tied directly to the output. The moment a 5% canary starts misbehaving, real users see broken responses — and worse, you may not even notice for hours because the failure mode is "subtly wrong" rather than "loudly broken".
This article walks through shadow mode, a pattern I rely on heavily for shipping Antigravity agents. In shadow mode, real traffic is fanned out to the new version in parallel, but its responses are never returned to the user. The single biggest benefit is that failures stay invisible. You get production-grade signal without ever risking user experience.
Why shadow mode — and how it differs from A/B and canary
There are three main strategies for rolling out an agent. They look similar from the outside but solve different problems.
A canary release sends a slice of real traffic to the new version and rolls back on failure. It catches deployment-time defects quickly, but whatever the new version returns is what users see. For chat agents and creative-output agents, a quality regression instantly becomes a UX regression.
A/B testing exists to decide which of two versions performs better, and it assumes both are already production-quality. It is the wrong tool for pre-release safety verification, and exposing users to a clearly worse experiment can also raise ethical concerns.
Shadow mode runs the new version in parallel with production but never returns its output to users. You log everything, then compare divergence, cost, latency and failure rate. You get to ask "can the new version handle real traffic?" without any user risk. For me, this is the safest option for the final pre-release check.
I prefer this approach because most agent failures aren't implementation bugs — they're unexpected behavior on inputs nobody thought to test. Unit tests cannot catch that; only real traffic can. Shadow mode is the only realistic way to expose the new version to that real traffic safely. If you've already invested in an evaluation framework — covered in Antigravity Agent Evaluation Production Framework — think of shadow mode as the live-traffic verification layer that sits on top of it.
Shadow architecture — request mirroring and the comparator
The design has four pillars. First, mirror the production request to the new version the moment it arrives. Second, never return the new version's response to the user — ship it to a side channel. Third, compare both outputs using structured, deterministic scores. Fourth, build a kill switch so a runaway new version stops itself.
In practice I keep the production agent on the synchronous request path and push the shadow agent to a background queue. That separation is non-negotiable: the new version's latency or errors must never bleed into your SLA. The general patterns for error containment in agents are covered more broadly in Agent Resilience and Error Handling for Production. With Antigravity agents, "production path = sync, shadow path = async" should be the default mental model.
The comparator design depends on what kind of agent you're shipping. For task-completion agents (code generation, classification, extraction), structural equivalence via hashes and schema validation works well. For conversational agents, semantic similarity (embedding cosine) plus auxiliary metrics (length, tone, refusal rate) is more realistic. I avoid relying on "LLM-as-a-judge" as the primary score because evaluator models drift; deterministic metrics make a much better foundation.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦You will be able to roll out new agent versions while keeping failures completely invisible to your users
✦You'll learn how to compare output drift, cost and latency with deterministic metrics, so promotion decisions become numbers, not gut feel
✦You can apply a four-stage rollout pipeline (shadow → 10% → 50% → 100%) with automatic kill switches to your own product today
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Let's start with the smallest possible setup that fans a request to both versions. Assuming you deploy from Antigravity to Cloudflare Workers, Hono is a good fit. The shadow call must run in ctx.waitUntil so it never adds a millisecond to the production path.
// src/worker/shadow-router.tsimport { Hono } from "hono";type Bindings = { AGENT_V1_URL: string; AGENT_V2_URL: string; SHADOW_QUEUE: Queue; SHADOW_ENABLED: string; // "true" | "false"};const app = new Hono<{ Bindings: Bindings }>();app.post("/api/agent/run", async (c) => { const body = await c.req.json(); const requestId = crypto.randomUUID(); const startedAt = Date.now(); // 1. Production agent on the sync path — this is the critical path. let primaryRes: Response; try { primaryRes = await fetch(c.env.AGENT_V1_URL, { method: "POST", headers: { "content-type": "application/json" }, body: JSON.stringify({ ...body, requestId, version: "v1" }), }); } catch (err) { console.error("primary_failed", { requestId, err: String(err) }); return c.json({ error: "agent_unavailable" }, 503); } const primaryJson = await primaryRes.clone().json<unknown>(); // 2. Shadow goes to a background queue via waitUntil — never blocks the response. if (c.env.SHADOW_ENABLED === "true") { c.executionCtx.waitUntil( enqueueShadow(c.env.SHADOW_QUEUE, { requestId, startedAt, request: body, primaryResponse: primaryJson, }).catch((err) => { // Swallow shadow enqueue errors — production must not be impacted. console.error("shadow_enqueue_failed", { requestId, err: String(err) }); }), ); } return new Response(primaryRes.body, { status: primaryRes.status, headers: primaryRes.headers, });});async function enqueueShadow(queue: Queue, payload: unknown): Promise<void> { await queue.send(payload, { contentType: "json" });}export default app;
Three details matter here. waitUntil ensures the shadow path never delays the response. Swallowing shadow enqueue failures cuts the contamination risk to zero. And SHADOW_ENABLED is the first stage of the kill switch — flip it and shadow stops everywhere within seconds.
Expected behavior: production responses still come back in 100–400ms, while shadow jobs accumulate in the queue and are drained by a separate worker asynchronously. Critically, no shadow trace must leak into response headers or body. If you skip that rule, browser caches and analytics logs will start carrying new-version artifacts and you will have a much harder debugging session later.
Implementation 2: structured divergence scoring
The shadow consumer worker actually calls v2 and compares its output to the captured production output. To keep the comparator general I always emit four metrics: schema validity, exact-string match, embedding cosine similarity, and auxiliary indicators (length difference, tone classifier).
About that 8s timeout: when the production agent averages 1–2s, shadow has two jobs — sometimes you want "let it run long, just observe", and sometimes you want "fail fast at production-like latency". I start permissive at 8s to capture the new version's true latency profile, then tighten toward p95 once it stabilizes. Tightening too early throws away the very evidence you need.
I'm strict about schema validation for one reason: contract drift on LLM outputs (a new version emits an extra field, or wraps strings differently) is one of the most common production breaks, and it shows up nowhere in unit tests. If shadow mode catches even just this single class of regression, it has paid for itself.
Implementation 3: cost, latency and error-rate watchdog with auto kill switch
Pairing shadow with an existing observability stack — see Antigravity OpenTelemetry Observability Pipeline — lets you fold shadow-specific metrics into the dashboards your team already watches.
Shadow doubles your API spend while it runs, so leaving it on indefinitely is dangerous. A misbehaving v2 can amplify that cost too. I run a Cloudflare Cron Trigger every minute to summarize recent samples and flip SHADOW_ENABLED to false whenever a threshold is breached.
The samples.length < 50 guard exists because I learned the hard way. One late night, traffic was low, three samples produced a meanCosine of 0.4 by sheer chance, the watchdog killed shadow, and I spent the next morning debugging a phantom. A minimum-sample-size guard is mandatory in any rollout-decision automation. If you have low traffic at off hours, hold the verdict until you have enough data.
Expected behavior: if meanCosine drops below 0.7, p95 exceeds 6s, or schema-violation rate exceeds 10%, shadow stops within a minute and your alerting channel pings. This single structural protection prevents the "I woke up to a shocking bill" scenario from ever happening again.
Promotion criteria — from shadow to 10%, 50% and 100% of real traffic
Once shadow has earned trust, it's time to graduate to real traffic. I move in four explicit stages: shadow 100% → real 10% → 50% → 100%. Each step has numeric — not vibes-based — criteria.
Going from shadow to real 10% I require: at least 24 hours of continuous shadowing, at least 10,000 samples, meanCosine ≥ 0.85, schema-violation rate < 1%, p95 latency ≤ 1.3× of production, cost ≤ 1.5× of production. Until all six are green, I don't promote.
Going from 10% to 50% I additionally require that product KPIs (CTR, conversation continuation rate, or whatever matters in your product) show no statistically significant regression versus the production baseline. From 50% to 100%, I require 48 hours of clean operation with zero incidents.
The crucial point is that you also need explicit "go back" criteria, not just "go forward". If meanCosine falls below 0.7 mid-rollout, traffic to v2 must auto-revert to shadow within minutes. I keep these criteria in version control as code, so changing them goes through review. Treat the criteria themselves as part of your infrastructure.
A subtle thing I learned: the criteria for moving forward should always be more conservative than the criteria for staying in place. If you require meanCosine ≥ 0.85 to advance, set the rollback threshold at meanCosine < 0.80 rather than < 0.85. This 5-point hysteresis band prevents oscillation, where the rollout flips back and forth as the metric jitters around a single threshold. Any control system without hysteresis will chatter; rollout pipelines are no exception.
Another lesson: always include a wall-clock minimum at every stage, not just a sample-count minimum. Even if you accumulate 10,000 samples in three hours, hold for the full 24 because user behavior has daily seasonality. A new version that looks great during business hours can quietly fall apart on weekend traffic patterns or international users hitting it at 3 a.m. local time. Time is part of the signal.
Cost economics: how I think about the price of running shadow
Shadow doubles your inference spend while it runs, so it's worth modeling the cost like an experiment budget rather than infrastructure overhead. I treat each shadow campaign as a discrete project with a cap, a duration, and a target sample size.
The simple formula I use is: expected_cost = qps × campaign_duration_hours × 3600 × cost_per_call_v2. For a service running at 5 qps, a 48-hour shadow campaign at $0.002 per v2 call costs roughly $1,728. That feels expensive in isolation, but it's an order of magnitude cheaper than a single user-visible regression that prompts refunds and a public apology.
To make the spend feel less open-ended, I split the budget by stage: 60% for shadow 100%, 25% for the 10% canary, 10% for the 50% canary, and 5% reserved for unplanned rollback investigations. Treating budget like a versioned config and reviewing it every campaign keeps surprise charges down. If a stage exhausts its share early, that's a signal to pause and investigate rather than silently overspend.
Keep an eye on token consumption too, not just API request count. A new version that returns longer responses can be quietly twice as expensive on token-priced models even if the request count looks identical. I always emit output_tokens alongside latency in the shadow consumer; it has caught silent cost regressions for me more than once.
What I look at on the dashboard each morning
Once shadow is running, having the right small set of charts becomes the difference between catching problems in 30 minutes versus a week. I try to keep the dashboard to a single screen, no scrolling, with five panels.
The first panel is divergence over time — meanCosine plotted as a 5-minute rolling average. A healthy rollout shows a flat line; a downward trend means the new version is drifting away from production behavior on a consistent class of inputs.
The second is schema-violation rate. This one is binary in spirit: anything above 0.5% needs immediate attention because schema breaks become user-visible the moment you promote. I draw a horizontal threshold line at 1% as a visual alarm.
The third is p95 latency, plotted side by side for v1 and v2. Looking at the ratio rather than absolute numbers makes it easier to see whether v2 is meaningfully slower under the same load conditions.
The fourth is cost per request, again v1 vs v2. This is what catches "the new prompt is producing 3× longer outputs" within hours rather than at the end of the month.
The fifth is divergence broken down by request type or user segment. Aggregate scores can hide a 95th-percentile-only problem — for instance, the new version might be fine for 95% of inputs but fail badly on a specific intent. Slicing by intent (or user tier, or input language) is what surfaces those tail problems before they become churn.
When NOT to use shadow mode
Shadow mode is not free, and there are situations where it's the wrong tool. I try to think hard about whether to skip it before adding the operational complexity.
If the new version's only change is a non-functional refactor — purely renaming files or extracting modules with no prompt or model changes — shadow adds cost without learning anything new. Unit and integration tests are the right gate.
If the agent's responses contain non-determinism that can't be normalized (timestamps, random IDs, model-side temperature variation), naive divergence metrics will flag false positives. You either need to normalize before comparing, or accept that semantic similarity is your only honest signal.
If your product has very low traffic (under 100 requests per day), shadow won't accumulate enough samples for the watchdog to make decisions. In that regime, manual eval against a curated golden set tends to give faster feedback. I move to shadow mode once traffic crosses roughly a few thousand requests per day, where the statistical signal becomes meaningful.
Finally, if the new version is intended to be deliberately different — say, a fundamentally new persona or a new task type — divergence from v1 is the goal, not the warning. Forcing a shadow comparator there will just generate noise. In that case, lean on direct human evaluation and skip the divergence dashboards entirely.
Common pitfalls
A few traps come up over and over. These five cost me time, so they're worth flagging.
Side-effects in shadow. If v2 calls real payment or email APIs, you can silently double-charge users. Shadow must run against sandbox endpoints or use idempotency keys you reset. I always thread a runtime: "shadow" flag through the agent's context so tools can suppress side effects on the shadow path.
Hidden cost from doubled API calls. Shadow doubles spend while it runs. Issue v2 its own API key so the bill is visible in isolation, and put a hard monthly cap on it. Without that, you'll find out from the month-end invoice rather than your dashboard.
Over-relying on LLM-as-a-judge. It's tempting to ask another model "which is better", but evaluator models drift; the same pair of outputs can rate differently from one day to the next. Anchor on deterministic metrics (schema, hash, length, embedding cosine), and use LLM-as-a-judge only as a tie-breaker.
Synchronous shadow calls leaking into production latency. If anyone awaits the shadow call on the request path, your p95 goes up immediately. Bake the rule "no await on shadow" into your code review checklist; a single missed waitUntil is enough to undo the whole design.
Verdicts on tiny samples. Without a minimum-sample-size guard, low-traffic windows will whipsaw your rollout decisions. Always require a floor (I use 50) before rendering any verdict.
Building the comparator beyond cosine similarity
Embedding cosine similarity is a fine baseline, but it has a known weakness: two responses can share most of the meaning while disagreeing on the one fact that matters. "The user's order will arrive on Tuesday" and "The user's order will arrive on Thursday" can have a cosine score above 0.9. To catch this class of regression I layer a few extra checks on top.
The first is span-level entity extraction. For agents that emit dates, prices, IDs, or contact details, I run a small extractor on both responses and compare the entities directly. If the entity sets disagree, I flag it as a high-severity divergence regardless of cosine score.
The second is a refusal-pattern classifier. New prompts sometimes cause the model to refuse where the previous version answered, or vice versa. Both are interesting signals, and a binary classifier (refusal yes/no) is cheap to run on every response. A spike in refusal rate often correlates with a too-strict new system prompt.
The third is a tool-call diff. If your agent uses tool calling, compare the sequence of tool invocations rather than just the final message. A new version that takes five steps where the old one took two might still produce a similar final answer, but the cost and latency profile is fundamentally different — and worth catching.
// src/worker/comparator-extras.tstype ToolCall = { name: string; args: Record<string, unknown> };export function diffToolSequences(a: ToolCall[], b: ToolCall[]): { shapeMatch: boolean; lengthDiff: number; firstDivergeIdx: number;} { const minLen = Math.min(a.length, b.length); let firstDivergeIdx = -1; for (let i = 0; i < minLen; i++) { if (a[i].name !== b[i].name) { firstDivergeIdx = i; break; } } return { shapeMatch: a.length === b.length && firstDivergeIdx === -1, lengthDiff: Math.abs(a.length - b.length), firstDivergeIdx, };}export function refusalLikelihood(text: string): number { // Tiny heuristic; replace with a real classifier in production. const cues = [ /i (cannot|can't|won't)/i, /not able to/i, /unable to comply/i, /against my (guidelines|policy)/i, ]; let hits = 0; for (const c of cues) if (c.test(text)) hits++; return Math.min(1, hits / 2);}
These small layered comparators are cheap and catch concrete failure modes that aggregate similarity scores miss. I add new layers as I encounter new failure modes, and over time the comparator becomes a kind of institutional memory of "things that have gone wrong before".
Storing shadow data: KV is fine to start, but plan the upgrade path
The examples in this article use Cloudflare KV for simplicity, but KV is best thought of as a write-many, scan-rarely store. Once you have more than a few thousand samples per day and want to slice by user segment or input intent, KV becomes painfully slow to query.
The upgrade path I recommend is to treat KV as a buffer and fan out to a columnar analytics store. R2 + ClickHouse and BigQuery are both good landing spots; the choice depends on which you already pay for. Even a daily cron that reads recent KV keys and appends to a Parquet file in R2 is enough to unblock ad-hoc analysis.
The data model I find useful has one row per shadow request with these columns: request_id, ts, request_hash, user_segment, input_intent, cosine, exact_hash_match, length_diff, tool_calls_v1_count, tool_calls_v2_count, latency_v1_ms, latency_v2_ms, cost_v1_usd, cost_v2_usd, refusal_v1, refusal_v2, schema_valid_v1, schema_valid_v2. With this layout a single SQL query can answer "is the new version cheaper for users on the free tier but more expensive for users on the pro tier?" — exactly the segmented view that surfaces problems aggregate scores hide.
I tag every row with a campaign_id (the v2 deployment hash) so I can compare campaigns side by side. Six months in, you start seeing patterns: certain prompt edits regress on certain segments. That institutional memory is worth more than any single rollout decision.
A real rollout flow I run on my own products
Finally, here's the actual flow I use on my personal products. Treat it as a starting point to adapt rather than a recipe to copy verbatim.
A new agent version is built on a feature branch, and merging the PR auto-deploys a v2 worker. CI sets SHADOW_ENABLED=true in Cloudflare KV, and shadow starts immediately. For the first 24 hours I observe only — Grafana dashboards on meanCosine, p95, and schema-validity rate.
If after 24 hours the criteria are green, the routing logic in /api/agent/run flips to "10% of real traffic gets v2 as primary". From here it follows a normal canary release. If anything looks off, flipping a KV flag reverts everything to v1 in seconds.
Since adopting this flow, my "rollback right after release" incidents have effectively gone to zero. Problems get caught in shadow, before users see them. The psychological cost of releases dropped sharply, which let me ship more often. Shadow mode genuinely changed my deployment culture — and I think it can change yours.
Operating agents in production is materially harder than running a normal web app. Output uncertainty, cost, latency and external API dependencies all conspire to make the old playbooks insufficient. But this kind of problem is rarely solvable without observing real traffic, and shadow mode is the most practical answer I've found for observing real traffic safely.
The smallest first step you can take today is adding a runtime flag to your agent context. That single line of plumbing puts you halfway to shadow mode whenever you decide to flip it on.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.