When Your Antigravity Agent Eval Gate Keeps Flickering — Build Notes on Pass/Fail That Survives Non-Determinism
Same code, yet the eval passes in the morning and fails by noon. The first thing that breaks when you put agent evaluation into CI on Antigravity is the stability of the verdict. Here's how I separate noise from real regression and lock down pass/fail in code.
On the first day I put agent evaluation into CI, I started doubting my own eval code. I hadn't changed a single character, yet the morning build was green and the noon build was red. The same test case passed and failed at random.
My first instinct was "let me fix this flaky test." But this wasn't a bug to fix. The real mistake was trying to measure a probabilistic system with a binary pass/fail in the first place.
As an indie developer writing agents on Antigravity, I've found that making the evaluation trustworthy turns out to be far harder than the evaluation itself. Here I want to share, with the texture of the actual code, how I quiet a flickering eval gate so that only genuine regressions turn it red.
Don't "fix" the variance — measure it
A probabilistic agent's output differs a little every time. Even with temperature pinned to zero, behavior wobbles with tool-call ordering, external API responses, and how tightly the context window is packed.
If you use a binary passed: true / false here, the verdict oscillates right on the boundary. Put a 0.8 threshold between scores of 0.79 and 0.81, and the same agent flips pass/fail across trials. That's a loss of information: you forced a continuous quantity through a single cut point, so the variance leaks straight into the verdict as noise.
There's only one healthy direction: drop the single-run verdict and judge on a distribution of scores across trials. Run the same case n times, look at the mean and the spread. A small spread means you can cut at a threshold with confidence; a large spread is itself a different signal — "this case is fundamentally unstable."
// eval-types.ts// Make the aggregate of multiple trials, not a single run, the unit of evaluation.export interface TrialResult { score: number; // 0.0 - 1.0 (continuous score that allows partial credit) latencyMs: number; toolCallCount: number; failures: string[];}export interface CaseStats { caseId: string; trials: number; meanScore: number; stdDev: number; // spread = a measure of instability ciLow: number; // lower bound of the 95% CI for the mean score p95LatencyMs: number; // look at the tail, not the average}export function aggregate(caseId: string, results: TrialResult[]): CaseStats { const n = results.length; const scores = results.map((r) => r.score); const mean = scores.reduce((a, b) => a + b, 0) / n; const variance = scores.reduce((a, s) => a + (s - mean) ** 2, 0) / n; const stdDev = Math.sqrt(variance); // Lower bound of the 95% CI from the standard error (z = 1.96) const stdErr = stdDev / Math.sqrt(n); const ciLow = mean - 1.96 * stdErr; const latencies = results.map((r) => r.latencyMs).sort((a, b) => a - b); const p95LatencyMs = latencies[Math.min(latencies.length - 1, Math.floor(n * 0.95))]; return { caseId, trials: n, meanScore: mean, stdDev, ciLow, p95LatencyMs };}
The key is that I gate on ciLow, the lower bound of the confidence interval. Even if the mean is 0.85, a wide spread can drag the lower bound down to 0.7. To keep out an agent that only passes when it gets lucky, it's safer to draw the gate at the pessimistic lower bound rather than the optimistic mean.
Separate noise-driven failures from genuine decay
Once you can measure spread across trials, the next question arrives: "when it goes red, is that noise, or did it actually get worse?" Get this wrong and you'll be jerked around by noise while missing real decay.
What I use is baseline comparison instead of an absolute threshold. Store the latest score from the main branch as a reference, and only go red when a new change drops significantly below it. Rather than a fixed line at 0.8, you ask "did this get worse than last time?"
// regression-gate.ts// Detect significant decay from a baseline, not against an absolute threshold.import type { CaseStats } from "./eval-types";export interface GateVerdict { pass: boolean; reason: string; regressions: string[]; flaky: string[];}export function judgeAgainstBaseline( current: CaseStats[], baseline: Record<string, { meanScore: number; stdDev: number }>,): GateVerdict { const regressions: string[] = []; const flaky: string[] = []; for (const cur of current) { const base = baseline[cur.caseId]; // Quarantine cases with too much spread as "unstable." Don't turn the // gate red; surface them separately as targets for improvement. if (cur.stdDev > 0.2) { flaky.push(`${cur.caseId}: stdDev=${cur.stdDev.toFixed(2)} (unstable, needs work)`); continue; } if (!base) continue; // No baseline for brand-new cases // Regression rule: did the CI lower bound fall below the baseline mean // by more than the noise band (the baseline's standard deviation)? const noiseBand = Math.max(0.05, base.stdDev); if (cur.ciLow < base.meanScore - noiseBand) { regressions.push( `${cur.caseId}: ${base.meanScore.toFixed(2)} -> ${cur.meanScore.toFixed(2)} (CI low ${cur.ciLow.toFixed(2)})`, ); } } const pass = regressions.length === 0; return { pass, reason: pass ? "No significant decay from baseline" : `Detected ${regressions.length} regression(s)`, regressions, flaky, };}
Two things earn their keep here. One is noiseBand: using the baseline's own spread as the noise width makes the verdict lenient on cases that were always wobbly and strict on ones that were stable. The other is routing unstable cases out of the gate into flaky. Don't fail the build because a case is unstable. Instead, make it visible as "a case whose design needs rethinking." By reserving red for genuine decay, you keep the red signal trustworthy.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦An implementation that drops single-run verdicts in favor of score distributions and confidence intervals across trials
✦A baseline-comparison and seed-pinning procedure that separates noise-driven failures from genuine regressions
✦A canary-evaluation setup that catches silent regressions in an era where default models quietly upgrade underneath you
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Pin seeds and the environment to shrink the variance
In parallel with measuring variance, kill the variance you can eliminate at the source. Not all non-determinism carries equal value.
For an eval run you generally want to fix: lower temperature and top_p for evaluation; pin tool mock responses (don't mix real-API latency jitter into the eval); inject a seed wherever you depend on time or randomness. And the one most easily overlooked — pin the model version.
// eval-config.ts// A bundle of settings that narrow non-determinism, only during evaluation.export const EVAL_MODEL_CONFIG = { // Pin the model string explicitly. // Never use "latest" or a default alias for evaluation (see below). model: "antigravity-preview-05-2026", temperature: 0, topP: 1, seed: 42, // on supporting models this raises determinism substantially maxOutputTokens: 4096,} as const;// Swap tool calls for mocks during evaluation to remove external jitter.export function buildEvalTools(fixtures: Record<string, unknown>) { return { async callTool(name: string, args: unknown) { const key = `${name}:${JSON.stringify(args)}`; if (key in fixtures) return fixtures[key]; // Make an unregistered call fail immediately, so you can detect // that "the evaluation is leaking into the real API." throw new Error(`Unregistered tool call: ${key}`); }, };}
Throwing on an unregistered fixture is deliberate. Without it, an "evaluation" can hit the real API, and latency, cost, and output all start to wobble. Designing it so "you notice the moment it leaks" keeps your eval environment pure.
Regression detection when the default model upgrades on its own
This is the part that bites in 2026. With Antigravity's Managed Agents and the Gemini-family APIs, it's increasingly common to run against a preview string like antigravity-preview-05-2026 or a default alias. Convenient — but it means the model can be swapped underneath you and behavior can change, without you touching a single line of your code.
That's nasty. Since the cause of the regression isn't in your commits, a baseline comparison against the tip of main won't catch all of it. Something that was green yesterday turns red today because of a model-side update, and you'll hunt for the offending commit and never find it.
The countermeasure is to run a canary evaluation that re-takes the baseline on a schedule, independent of code changes. Antigravity has made scheduled launches easier to work with since v2.1.4, so run the agent nightly against the same dataset and record the score over time. When a delta appears, make it possible to tell whether it came from a commit or from the model.
// canary-track.ts// Re-take the baseline daily and catch model-update regressions over time.import * as fs from "node:fs";interface CanaryEntry { date: string; modelLabel: string; // identifier of the model that actually responded, if available meanScore: number; worstCases: string[]; // top cases whose score dropped sharply}export function recordCanary(entry: CanaryEntry, path = "canary-history.json") { const history: CanaryEntry[] = fs.existsSync(path) ? JSON.parse(fs.readFileSync(path, "utf-8")) : []; history.push(entry); // Compare the last two points; warn if it dropped without a code change. if (history.length >= 2) { const [prev, cur] = history.slice(-2); const drop = prev.meanScore - cur.meanScore; if (drop > 0.05) { const modelChanged = prev.modelLabel !== cur.modelLabel; console.warn( `Score drop ${drop.toFixed(2)} (${prev.date} -> ${cur.date})` + (modelChanged ? ` / model changed ${prev.modelLabel} -> ${cur.modelLabel}. Likely a model-side regression` : ` / model identifier unchanged. Suspect data quality or an external dependency`), ); } } fs.writeFileSync(path, JSON.stringify(history.slice(-90), null, 2));}
If you can log the identifier of the model that responded, always do. When a change in modelLabel coincides with a score drop, you can quickly conclude it's a model-side regression. If the identifier stays the same and only the score falls, suspect your dataset or an external dependency instead. Automating the root-cause split is your best defense against decay that creeps in quietly overnight.
Balancing trial count against cost
Multiple trials work, but they aren't free. Running one case five times multiplies both API cost and wall-clock time by five. Run all 50 cases five times on every commit and CI gets so heavy nobody waits for it.
The split I settled on is this: at the pull-request stage, run stable cases once and only the cases in the flaky bucket five times. Stable cases don't wobble, so multiple trials add little value there. Decide the trial count dynamically from each case's past spread rather than a fixed number, and you concentrate cost where it's needed.
In the nightly canary, do the opposite: run every case five-plus times with time to spare. Nobody is waiting here, so it's the place to favor precision. Vary the trial count between "evaluation people wait on" and "evaluation nobody waits on." That single move let me get the perceived speed of CI back while keeping the evaluation trustworthy. As an indie developer, CI time is literally my own waiting time, so this allocation pays off especially well.
Where to start
Trying to perfect the whole evaluation at once usually ends with you running out of steam halfway. The realistic order is to first build only "multiple trials + gate on the CI lower bound" over your local 10–20 cases. That alone visibly cuts the time you waste chasing flaky reds.
From there, add baseline comparison and the canary. You don't need to build all the way to production shadow evaluation from the start. It's better to run something small while watching which cases sit in the flaky bucket — that's how the prompts and tool designs worth fixing surface fastest.
Measure the probabilistic as probabilistic. It sounds obvious, yet it's a hard reframe for a mind trained on binary verdicts. For me, only once I could trust the eval gate was I able to focus on actually improving the agent. If you're stuck flickering in the same place, I hope this gives you a thread to pull.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.