ANTIGRAVITY LABJP
Articles/Agents & Manager
Agents & Manager/2026-06-16Advanced

When Your Antigravity Agent Eval Gate Keeps Flickering — Build Notes on Pass/Fail That Survives Non-Determinism

Same code, yet the eval passes in the morning and fails by noon. The first thing that breaks when you put agent evaluation into CI on Antigravity is the stability of the verdict. Here's how I separate noise from real regression and lock down pass/fail in code.

antigravity358ai-agent14evaluation3ci-cd9testing10reliability7

Premium Article

On the first day I put agent evaluation into CI, I started doubting my own eval code. I hadn't changed a single character, yet the morning build was green and the noon build was red. The same test case passed and failed at random.

My first instinct was "let me fix this flaky test." But this wasn't a bug to fix. The real mistake was trying to measure a probabilistic system with a binary pass/fail in the first place.

As an indie developer writing agents on Antigravity, I've found that making the evaluation trustworthy turns out to be far harder than the evaluation itself. Here I want to share, with the texture of the actual code, how I quiet a flickering eval gate so that only genuine regressions turn it red.

Don't "fix" the variance — measure it

A probabilistic agent's output differs a little every time. Even with temperature pinned to zero, behavior wobbles with tool-call ordering, external API responses, and how tightly the context window is packed.

If you use a binary passed: true / false here, the verdict oscillates right on the boundary. Put a 0.8 threshold between scores of 0.79 and 0.81, and the same agent flips pass/fail across trials. That's a loss of information: you forced a continuous quantity through a single cut point, so the variance leaks straight into the verdict as noise.

There's only one healthy direction: drop the single-run verdict and judge on a distribution of scores across trials. Run the same case n times, look at the mean and the spread. A small spread means you can cut at a threshold with confidence; a large spread is itself a different signal — "this case is fundamentally unstable."

// eval-types.ts
// Make the aggregate of multiple trials, not a single run, the unit of evaluation.
 
export interface TrialResult {
  score: number;        // 0.0 - 1.0 (continuous score that allows partial credit)
  latencyMs: number;
  toolCallCount: number;
  failures: string[];
}
 
export interface CaseStats {
  caseId: string;
  trials: number;
  meanScore: number;
  stdDev: number;        // spread = a measure of instability
  ciLow: number;         // lower bound of the 95% CI for the mean score
  p95LatencyMs: number;  // look at the tail, not the average
}
 
export function aggregate(caseId: string, results: TrialResult[]): CaseStats {
  const n = results.length;
  const scores = results.map((r) => r.score);
  const mean = scores.reduce((a, b) => a + b, 0) / n;
  const variance = scores.reduce((a, s) => a + (s - mean) ** 2, 0) / n;
  const stdDev = Math.sqrt(variance);
 
  // Lower bound of the 95% CI from the standard error (z = 1.96)
  const stdErr = stdDev / Math.sqrt(n);
  const ciLow = mean - 1.96 * stdErr;
 
  const latencies = results.map((r) => r.latencyMs).sort((a, b) => a - b);
  const p95LatencyMs = latencies[Math.min(latencies.length - 1, Math.floor(n * 0.95))];
 
  return { caseId, trials: n, meanScore: mean, stdDev, ciLow, p95LatencyMs };
}

The key is that I gate on ciLow, the lower bound of the confidence interval. Even if the mean is 0.85, a wide spread can drag the lower bound down to 0.7. To keep out an agent that only passes when it gets lucky, it's safer to draw the gate at the pessimistic lower bound rather than the optimistic mean.

Separate noise-driven failures from genuine decay

Once you can measure spread across trials, the next question arrives: "when it goes red, is that noise, or did it actually get worse?" Get this wrong and you'll be jerked around by noise while missing real decay.

What I use is baseline comparison instead of an absolute threshold. Store the latest score from the main branch as a reference, and only go red when a new change drops significantly below it. Rather than a fixed line at 0.8, you ask "did this get worse than last time?"

// regression-gate.ts
// Detect significant decay from a baseline, not against an absolute threshold.
 
import type { CaseStats } from "./eval-types";
 
export interface GateVerdict {
  pass: boolean;
  reason: string;
  regressions: string[];
  flaky: string[];
}
 
export function judgeAgainstBaseline(
  current: CaseStats[],
  baseline: Record<string, { meanScore: number; stdDev: number }>,
): GateVerdict {
  const regressions: string[] = [];
  const flaky: string[] = [];
 
  for (const cur of current) {
    const base = baseline[cur.caseId];
 
    // Quarantine cases with too much spread as "unstable." Don't turn the
    // gate red; surface them separately as targets for improvement.
    if (cur.stdDev > 0.2) {
      flaky.push(`${cur.caseId}: stdDev=${cur.stdDev.toFixed(2)} (unstable, needs work)`);
      continue;
    }
 
    if (!base) continue; // No baseline for brand-new cases
 
    // Regression rule: did the CI lower bound fall below the baseline mean
    // by more than the noise band (the baseline's standard deviation)?
    const noiseBand = Math.max(0.05, base.stdDev);
    if (cur.ciLow < base.meanScore - noiseBand) {
      regressions.push(
        `${cur.caseId}: ${base.meanScore.toFixed(2)} -> ${cur.meanScore.toFixed(2)} (CI low ${cur.ciLow.toFixed(2)})`,
      );
    }
  }
 
  const pass = regressions.length === 0;
  return {
    pass,
    reason: pass ? "No significant decay from baseline" : `Detected ${regressions.length} regression(s)`,
    regressions,
    flaky,
  };
}

Two things earn their keep here. One is noiseBand: using the baseline's own spread as the noise width makes the verdict lenient on cases that were always wobbly and strict on ones that were stable. The other is routing unstable cases out of the gate into flaky. Don't fail the build because a case is unstable. Instead, make it visible as "a case whose design needs rethinking." By reserving red for genuine decay, you keep the red signal trustworthy.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
An implementation that drops single-run verdicts in favor of score distributions and confidence intervals across trials
A baseline-comparison and seed-pinning procedure that separates noise-driven failures from genuine regressions
A canary-evaluation setup that catches silent regressions in an era where default models quietly upgrade underneath you
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Agents & Manager2026-04-16
Building an AI Test Pipeline with Antigravity Agents: Automating Quality Assurance in Production
Learn how to build a production-grade automated test pipeline using Antigravity's AI agents — from unit test generation to E2E testing with Playwright, complete with validation layers and CI/CD integration.
Agents & Manager2026-04-13
Building a Coding Agent System with Gemma 4 × Antigravity — A Complete Implementation Guide for Code Review, Test Generation, and Refactoring
A hands-on guide to building a 3-agent collaborative system using Gemma 4 and Antigravity AgentKit 2.0, covering code review, automated test generation, and refactoring suggestions. Includes production-quality code and pitfall solutions.
Agents & Manager2026-06-12
When an AI Agent's git push Reports Success but Nothing Reaches the Remote
Why agent-automated git pushes fail silently (a missing identity plus a no-op push), with three fixes: explicit config, SHA verification, and the GitHub REST API.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →