Where to Put Evidence and Approval When Your Agent Self-Debugs in a Real Browser

Antigravity 2.0 launches a real Chrome mid-build, clicking buttons and taking screenshots to self-heal. It is fast, but shipping that as-is is risky. Here is how to capture evidence and draw the approval boundary.

Antigravity 2.0¹⁴ browser agent self-debugging verification design agent operations⁵

✦ Premium Article

Antigravity 2.0 opens a real Chrome partway through a build. It loads the UI it just generated, clicks buttons, types into forms, takes screenshots, and tries to find and fix defects without a human ever opening DevTools. The first time I watched it, I will admit it unsettled me a little. It is fast. Within a few minutes you get back something that looks like it works.

The problem is what "looks" contains. If you cannot see what the agent checked, which buttons it pressed, and where it made the fix, you cannot make a call about shipping. Trading verification transparency for speed defeats the purpose. This article designs, concretely, how to keep evidence and where to draw the approval boundary so real-browser self-debugging can live in daily operations.

Why real-browser self-debug is fast, and why it is scary as-is

Older agents tended to stop at "this probably works" after writing code. Real-browser self-debug continues into "actually open it and confirm" as one motion. Broken layouts, buttons that do not respond to clicks, exceptions in the console — static analysis cannot catch these. An agent that can surface and repair defects that only appear when you actually touch the page is a real step forward.

But a real browser has side effects. Submitting a form fires an actual request; a link navigates to an actual page. If self-debug runs while dev and production environments are still blurred together, a run meant as a test can mutate production data. And if the repair process is not recorded, you can verify neither why it was fixed nor whether it was truly fixed.

So the need is to add two things without killing the speed: evidence, and an approval boundary.

Keep evidence in three layers

Call one execution of self-debug a "run," and cut a timestamped directory per run. Inside it, keep three kinds of proof. Screenshots let a human grasp state at a glance, DOM snapshots enable diffing, and network logs confirm what side effects occurred.

Layer	What to keep	When it helps
Screenshots	PNGs before/after each step	Spot broken layouts, blank states, error screens by eye
DOM snapshots	Formatted outerHTML text	Diff between runs to see "what changed" mechanically
Network logs	Method, URL, status, target host	Detect dangerous side effects like POSTs to production

The script below wraps a browser session with Playwright and saves all three layers each time the agent acts. The premise is that the agent's browser actions run through this thin wrapper.

// evidence-session.mjs — a thin wrapper that captures evidence around real browser actions
import { chromium } from 'playwright';
import { mkdir, writeFile, appendFile } from 'node:fs/promises';
import { join } from 'node:path';
 
const RUN_ID = new Date().toISOString().replace(/[:.]/g, '-');
const RUN_DIR = join('evidence', RUN_ID);
 
export async function openEvidenceSession() {
  await mkdir(RUN_DIR, { recursive: true });
  const browser = await chromium.launch({ headless: false });
  const context = await browser.newContext();
  const page = await context.newPage();
 
  // Network layer: append every response as one JSON line
  page.on('response', async (res) => {
    const req = res.request();
    const line = JSON.stringify({
      t: Date.now(),
      method: req.method(),
      url: res.url(),
      status: res.status(),
      host: new URL(res.url()).host,
    });
    await appendFile(join(RUN_DIR, 'network.jsonl'), line + '\n');
  });
 
  let step = 0;
  // Screenshot + DOM layers, saved per step
  async function capture(label) {
    const n = String(++step).padStart(3, '0');
    await page.screenshot({ path: join(RUN_DIR, `${n}-${label}.png`) });
    const html = await page.evaluate(() => document.documentElement.outerHTML);
    await writeFile(join(RUN_DIR, `${n}-${label}.html`), html);
    return n;
  }
 
  return { browser, page, capture, RUN_DIR };
}

The point of this wrapper is not to stop the agent. Instead of a human checking each move, it quietly stacks up material you can review afterward. Speed stays; you only add verifiability.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A concrete layout for capturing self-debug results as three evidence layers: screenshots, DOM snapshots, and network logs

✦A guard that enforces reads-auto / writes-and-production-require-a-human in code, not in convention

✦How to make the same fix reproducible with a fixed run-id and clock, reducing browser non-determinism

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Enforce the approval boundary in code

Evidence alone cannot stop the dangerous action itself. That is where the boundary comes in. The rule is simple: reads are automatic, writes and production URLs need a human. Let GET and navigation flow freely; block POST/PUT/DELETE and connections to production hosts unless explicitly permitted.

// guard.mjs — block mutating actions and requests aimed at production
const PROD_HOSTS = new Set(['app.example.com', 'api.example.com']);
const MUTATING = new Set(['POST', 'PUT', 'PATCH', 'DELETE']);
 
export function installGuard(page, { allowMutations = false } = {}) {
  return page.route('**/*', (route) => {
    const req = route.request();
    const host = new URL(req.url()).host;
    const isProd = PROD_HOSTS.has(host);
    const isMutating = MUTATING.has(req.method());
 
    if (isProd) {
      // Production is blocked unconditionally (self-debug is staging-only)
      return route.abort('accessdenied');
    }
    if (isMutating && !allowMutations) {
      // Writes pass only when the allow flag is set
      return route.abort('accessdenied');
    }
    return route.continue();
  });
}

Keeping allowMutations false by default is the crux. Even if the agent decides it wants to "confirm all the way through form submission," writes will not go through unless the developer explicitly passes true. Human approval collapses onto a single flag. Not scattering the decision is what leaves you able to explain, later, why something was allowed.

Reduce non-determinism and reproduce the same fix

A real browser returns slightly different results each time due to clock, animation, and network latency. A self-debug that "happened to pass this time" will regress in production. To raise reproducibility, fix three things.

First, thread a run-id through all evidence so you can uniquely tie which screenshots and network logs belong to the same execution. Second, freeze animation and the clock. Fixing the clock with page.clock and disabling transitions in CSS greatly reduces screenshot jitter. Third, do not delete directories for failed runs. Text-diffing the DOM snapshots from before and after a repair shows, line by line, what change made it pass.

// Determinism setup (run once before capture)
await page.clock.install({ time: new Date('2026-07-04T00:00:00Z') });
await page.addStyleTag({
  content: `*,*::before,*::after{transition:none!important;animation:none!important}`,
});

After adding these three, I became able to read and judge self-debug results. Before, I had no choice but to trust a green checkmark; now I open the run directory and see the screenshots lined up in the order they were clicked, with a record that dangerous requests were blocked.

As an indie developer, I run daily automation across several of my own sites by handing work to agents, and what always scared me about delegating was "proceeding while blind." Real-browser self-debug is the feature where that anxiety surfaces most. That is exactly why I settled on this shape — adding only evidence and approval without shaving off the speed. Fast, and explainable after the fact. At a scale one person operates, this balance feels just right.

Your next step

Take one of your existing self-debug targets and run it once with the evidence directory above. Just opening the run directory and looking at the sequence of screenshots and network.jsonl starts to reveal what the agent actually touched. From there, adding your project's production hosts to PROD_HOSTS and tightening the approval boundary one notch at a time is a realistic way forward.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.