Articles/Agents & Manager

◈ Agents & Manager/2026-06-18Advanced

Handing Visual Regression to a Parallel Agent in Antigravity 2.0

A design for running a dedicated headless visual-regression agent alongside your main implementation agent using Antigravity 2.0's parallel orchestration — with a working harness and the reproducibility traps I hit in production.

antigravity³⁷⁴ agents⁹⁵ visual-regression² testing¹¹ playwright³

✦ Premium Article

Every time I add one more iPhone resolution, I get nervous that a layout is off by half a pixel somewhere. The wallpaper apps I run as an indie developer accumulate more screenshots as device support grows, and I used to end up comparing them one by one by eye. When I added the iPhone Air at 420×912 and the iPhone 17 Pro at 402×874, I missed a grid that was misaligned in exactly one spot until right before the App Store release.

Today, June 18, Gemini CLI folded into Antigravity CLI, and Antigravity 2.0's parallel orchestration became the working assumption for real projects. The third example Google gives for parallel execution is "another agent runs a headless visual regression test." For someone like me who had been relying on the human eye to the very end, that quiet capability turns out to matter. This article walks through the minimal harness for building that regression agent, plus the traps I hit running it in production.

Why pull visual regression into a separate agent

Visual regression testing — capture screenshots, compare them against approved baselines, stop when a diff appears — runs on a different clock than implementation. Run it while you are still writing code and you stall waiting; run it all at once afterward and you lose track of which change broke what.

What Antigravity 2.0 changes is that you can physically separate this inspection from the main implementation agent. While one agent rewrites a component, another keeps diffing against the previous commit's baseline. The time that implementation and inspection used to fight over inside one conversation now flows in parallel.

I think of this split the same way I learned to run four sites in parallel: keep the agent that generates content separate from the agent that runs the gates, and one stalling never freezes the other. Visual regression is the same. Put the "guardian of appearance" in its own process and the safety net runs continuously without dragging down implementation speed.

The minimal core: capture headless, diff against a baseline

Start with the pure core of visual regression, no parallelism or agents yet. It is a tiny harness that captures several viewports with Playwright and counts diff pixels with pixelmatch.

// vr/capture.mjs — capture a URL across multiple viewports
import { chromium } from "playwright";
import { mkdir } from "node:fs/promises";
 
// Real-device-like resolutions, using one of my app promo pages as the target
const VIEWPORTS = [
  { name: "iphone-air",       width: 420, height: 912 },
  { name: "iphone-17-pro",    width: 402, height: 874 },
  { name: "iphone-16-promax", width: 440, height: 956 },
  { name: "desktop",          width: 1280, height: 800 },
];
 
export async function capture(url, outDir) {
  await mkdir(outDir, { recursive: true });
  const browser = await chromium.launch(); // headless by default
  try {
    for (const vp of VIEWPORTS) {
      const page = await browser.newPage({
        viewport: { width: vp.width, height: vp.height },
        deviceScaleFactor: 2, // capture at Retina scale
      });
      await page.goto(url, { waitUntil: "networkidle" });
      await page.screenshot({ path: `${outDir}/${vp.name}.png`, fullPage: true });
      await page.close();
    }
  } finally {
    await browser.close();
  }
}

The diff side aligns the baseline and the new image to the same size, then compares pixels. pixelmatch returns the number of mismatched pixels, so we judge by the ratio against the total.

// vr/diff.mjs — return the diff ratio between baseline and current
import { PNG } from "pngjs";
import pixelmatch from "pixelmatch";
import { readFileSync, writeFileSync } from "node:fs";
 
export function diff(baselinePath, currentPath, diffOutPath) {
  const base = PNG.sync.read(readFileSync(baselinePath));
  const cur = PNG.sync.read(readFileSync(currentPath));
 
  // A size mismatch is itself "breakage" — fail immediately
  if (base.width !== cur.width || base.height !== cur.height) {
    return { changed: 1.0, reason: "dimension-mismatch" };
  }
 
  const { width, height } = base;
  const out = new PNG({ width, height });
  const mismatched = pixelmatch(
    base.data, cur.data, out.data, width, height,
    { threshold: 0.1, includeAA: false } // includeAA:false is the key to fighting flakiness
  );
  writeFileSync(diffOutPath, PNG.sync.write(out));
 
  const ratio = mismatched / (width * height);
  return { changed: ratio, reason: ratio > 0 ? "pixel-diff" : "identical" };
}

Wire those two together and you have the minimal regression loop: capture, compare, keep the diff image. Nothing special is needed up to here.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦If you have been catching layout breakage by eye, you can stand up an automatic screenshot-diff gate today

✦You will learn the three pillars of visual regression — headless capture, baseline management, threshold tuning — with copy-and-run code

✦You can wire a regression agent into Antigravity 2.0's parallel pipeline and keep a visual safety net without slowing implementation down

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Designing baseline storage and updates

The part of visual regression that actually needs design is not the test code — it is how you handle baselines. Without deciding where the reference images live and when they get updated, you quickly reach the "everything is red" state where nobody looks anymore.

I keep these three states explicitly separated:

Approved baseline: images committed to the repo and treated as correct. They live in vr/baseline/.
Current capture: the temporary images a CI run or an agent produced. They go to vr/current/, outside git.
Diff image: the overlay a human reviews by eye. It goes to vr/diff/.

Updates happen only on an "intended visual change," through an explicit command. Never overwriting implicitly is the important part.

// vr/run.mjs — switch compare and approve modes in one script
import { capture } from "./capture.mjs";
import { diff } from "./diff.mjs";
import { cpSync, readdirSync } from "node:fs";
 
const APPROVE = process.argv.includes("--approve");
const THRESHOLD = 0.002; // fail when the diff exceeds 0.2% of the frame
const URL = process.env.VR_URL ?? "http://localhost:3000";
 
await capture(URL, "vr/current");
 
if (APPROVE) {
  // promote current to baseline — approving an intended change
  cpSync("vr/current", "vr/baseline", { recursive: true });
  console.log("Baseline updated");
  process.exit(0);
}
 
let failed = false;
for (const file of readdirSync("vr/current")) {
  const r = diff(`vr/baseline/${file}`, `vr/current/${file}`, `vr/diff/${file}`);
  const pct = (r.changed * 100).toFixed(3);
  if (r.changed > THRESHOLD) {
    console.error(`FAIL ${file}: ${pct}% changed (${r.reason})`);
    failed = true;
  } else {
    console.log(`PASS ${file}: ${pct}%`);
  }
}
process.exit(failed ? 1 : 0);

Just enforcing that the baseline moves only with --approve nearly eliminates the "the reference drifted without me noticing" accident. I recommend keeping that approval action in human hands and never handing it to the agent. Rewriting the baseline is the final responsibility for appearance itself.

Taming antialiasing flakiness with thresholds

The first wall you hit after adopting visual regression is "a tiny diff appears every run even though I changed nothing." The usual cause is font antialiasing and subpixel rendering — when the environment or GPU changes, the few pixels along edges wobble.

I handle this in two layers. First, includeAA: false in pixelmatch excludes antialiased edges from the diff. The wobble that remains is absorbed by a ratio threshold against the whole frame (0.2% in my case). A "fail if even one pixel differs" setting looks ideal but in reality becomes a test nobody can pass.

// How to set the threshold: capture the same page twice to measure the "noise floor"
import { capture } from "./capture.mjs";
import { diff } from "./diff.mjs";
 
await capture(process.env.VR_URL, "vr/_a");
await capture(process.env.VR_URL, "vr/_b");
 
// Diff of identical screens = the lower bound of environmental noise.
const r = diff("vr/_a/desktop.png", "vr/_b/desktop.png", "vr/_noise.png");
console.log(`noise floor: ${(r.changed * 100).toFixed(4)}%`);
// e.g. if this prints 0.03%, a threshold around 0.1–0.2% is realistic

Measure the noise floor of an identical screen first, then set the threshold to 3–5x that instead of guessing. This alone sharply lowers the odds of ending up with a regression test so flaky everyone ignores it.

Wiring it into Antigravity 2.0's parallel pipeline

Now the main point. Antigravity CLI can launch non-interactively (headless) and stream results to stdout. Separate from the implementation agent, we stand up an agent dedicated solely to visual regression and run it in parallel.

The idea is simple: every time the main agent rewrites UI, the regression agent runs build, local serve, capture, diff, and pushes back to the main side when a diff appears. With the CLI you can carve the inspection out as a single job like this.

# vr/agent-check.sh — the inspection job the regression agent runs (non-interactive)
set -euo pipefail
 
npm run build
# start the preview server and wait for it to come up
npm run preview &
SERVER_PID=$!
npx wait-on http://localhost:3000
 
# run visual regression; on failure, emit the diff dir to stdout
if ! VR_URL=http://localhost:3000 node vr/run.mjs; then
  echo "VR_FAILED diff_dir=vr/diff"
  kill $SERVER_PID
  exit 1
fi
kill $SERVER_PID
echo "VR_PASSED"

Register this job as an Antigravity parallel task and run it on a separate lane from the implementation task. In AGENTS.md, it is safest to spell out the regression agent's duty as "judge and report diffs only — never update the baseline."

<!-- AGENTS.md (excerpt): make the regression agent's boundary explicit -->
## visual-regression agent
- Role: after a UI change, run `vr/agent-check.sh` and report whether a diff exists
- Output contract: end stdout with `VR_PASSED` on success / `VR_FAILED diff_dir=...` on failure
- Forbidden: running `--approve`, writing to `vr/baseline/`
- On failure: push the failing screen names and change ratios back to the implementation agent

Decide a one-line "output contract" and the main agent can make its next decision by looking only at the regression agent's last line. Instead of relying on natural-language back-and-forth, you join them with the fixed signals VR_PASSED / VR_FAILED. When coordinating parallel agents, that mechanical handshake is what works.

Traps I hit in production

Once you actually start running it, you stumble on capture reproducibility more than on the test logic. Here are the representative traps that stopped me.

Font loading races. Even after waiting for networkidle, web fonts may not apply in time, producing a diff only on the first run. Inserting await page.evaluate(() => document.fonts.ready) before capture stabilized it.

Dynamic content bleeding in. Dates, random ordering, and animation change every run, so capturing them as-is leaves you red forever. Mask the relevant elements with visibility: hidden before capture, or stop animation with prefers-reduced-motion.

Viewport vs. device mismatch. Confusing CSS pixels with physical resolution and forgetting deviceScaleFactor gives you a blurry diff across the whole screen. Device resolutions are roughly physical pixels, so you have to divide back to a logical viewport. I aligned everything to deviceScaleFactor: 2, cross-checked against the App Store screenshot requirements.

Nondeterminism between CI and local. The same code renders slightly differently on Linux CI and on macOS locally. It is safest to hold one baseline set per "capture environment," and I treat only the images captured in the CI environment as the canonical baseline.

None of these are written large in the official docs, but whether visual regression survives in production is decided almost entirely by nailing down this reproducibility.

Where to trust the agent, and where humans look

Finally, the operational line. I let the agent handle "detect a diff and stop," and I keep "is this diff a correct change?" in human hands. When the regression agent goes red, only a human holding the visual intent can decide whether it is an intended redesign or an accident.

Concretely, I limit the regression agent's authority to inspection and reporting, and I run the baseline update (--approve) myself after looking at the diff image. Letting the agent approve risks the worst accident — freezing breakage in place as "correct." Parallelism speeds up the cadence of inspection; it does not delegate the final responsibility for appearance. That division is my conclusion.

Start by dropping just vr/capture.mjs and vr/diff.mjs into a project at hand and measuring the noise floor of an identical screen. Once you have a feel for the threshold, connecting it to a parallel agent is a natural extension of that.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.