When Your Antigravity-Written Tests Only Look Green: Measuring Effectiveness with Mutation Testing
Tests written by an Antigravity agent can pass and still fail to catch the bug that matters. Here is how to measure their real effectiveness with mutation testing and only adopt tests after they kill the surviving mutants — with working code.
As an indie developer, while reworking the AdMob ad-display logic in one of my apps, I asked an Antigravity agent to "write tests for this module." Back came a dozen-plus cases, all green. I shipped with confidence — and a bug slipped through where, under one specific condition, the interstitial fired twice.
Every test had passed.
Reading them back, the generated tests mostly checked that "the function doesn't throw" or that "the return value is truthy." Not one of them asserted the behavior right at the boundary, where the logic actually changes.
The more you delegate tests to an agent, the more your whole workflow hinges on spotting these pass-only tests. This article walks through using mutation testing — which measures the effectiveness of the tests themselves — to filter them before you adopt them.
Why a green test alone can't be trusted
A passing test fails to distinguish two completely different states. One is "it passes because the implementation is correct." The other is "it passes because the test verifies nothing."
Coverage has the same trap. 100% line coverage only guarantees that a line ran. If a line executes but nobody checks the result, bugs walk straight through.
AI-written tests lean toward exactly this kind of weak assertion. Unless you say otherwise in the prompt, the agent takes the shortest path to "a test that doesn't fail." Even tautological assertions — recomputing a result with the same expression and comparing it to itself — pass without complaint.
What you need is not a measure of "did the test pass" but of "can the test catch a bug." That is what mutation testing gives you.
What mutation testing actually measures
The idea is simple. Deliberately inject a tiny bug (a mutant) into the implementation and see whether the tests notice and fail.
For example, change >= to >, + to -, return true to return false, && to ||. A tool generates many such single-point edits mechanically.
For each mutant, there are two outcomes.
Outcome
Meaning
What it says about your tests
killed
Some test failed because of the edit
A good test that catches that bug
survived
Every test stayed green despite the edit
A blind spot nobody verifies
The share of mutants you kill is the mutation score. The surviving mutants are, quite literally, a list of behaviors your tests overlook. The double-ad bug from the opening surfaced exactly this way once I ran it later: "the mutant that flips the double-fire guard survived."
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦The pattern that makes AI-generated tests pass without verifying anything, and how a mutation score exposes it
✦A minimal Stryker setup for an effectiveness gate, and how to wire it into the agent loop
✦A realistic loop that hands surviving mutants to the agent so it targets exactly the blind spots
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
For JavaScript / TypeScript, Stryker is easy to reach for. As a subject, here is a small module that decides whether an ad may be shown.
// adReady.jsexport function canShowInterstitial(state) { // After one display since launch, wait at least 60 seconds if (state.lastShownAt !== null) { const elapsed = state.now - state.lastShownAt; if (elapsed < 60_000) return false; } // Never show to paying users if (state.isPremium) return false; return true;}
Here is what the agent wrote first.
// adReady.weak.test.jsimport { describe, it, expect } from "vitest";import { canShowInterstitial } from "./adReady.js";describe("canShowInterstitial", () => { it("returns a value", () => { const result = canShowInterstitial({ now: 100_000, lastShownAt: null, isPremium: false, }); expect(result).toBeDefined(); }); it("does not throw for paying users", () => { expect(() => canShowInterstitial({ now: 0, lastShownAt: null, isPremium: true }) ).not.toThrow(); });});
These tests are green, and coverage doesn't look bad at a glance. Let's add Stryker and measure effectiveness.
The number makes it plain: the green tests verified almost none of the logic. The run can't clear break: 70, so this commit stops in CI.
Reading the survivors
Every surviving mutant is a blueprint for a test you should write.
The fact that if (state.isPremium) return false survived being changed to return true means nothing asserts the single most important rule — "don't show ads to paying users." Structurally, it was the exact mistake I made in the opening.
Rewrite the weak tests so they kill the mutants.
// adReady.strong.test.jsimport { describe, it, expect } from "vitest";import { canShowInterstitial } from "./adReady.js";const base = { now: 1_000_000, lastShownAt: null, isPremium: false };describe("canShowInterstitial", () => { it("shows when conditions are met", () => { expect(canShowInterstitial(base)).toBe(true); }); it("never shows to paying users", () => { expect(canShowInterstitial({ ...base, isPremium: true })).toBe(false); }); it("does not show within 60s of the last display", () => { expect( canShowInterstitial({ ...base, lastShownAt: base.now - 59_999 }) ).toBe(false); }); it("shows at exactly 60s elapsed (boundary)", () => { expect( canShowInterstitial({ ...base, lastShownAt: base.now - 60_000 }) ).toBe(true); });});
That last "exactly 60s" case kills the mutant that swaps < for <=. One value placed right on the boundary leaves the off-by-one edit nowhere to survive. Re-run, and the score reaches 100% and clears the gate.
Wiring it into the agent loop
This is where Antigravity earns its place. Reading survivors by hand doesn't scale once there are many. The list of surviving mutants is exactly the kind of input an agent handles well.
Stryker can emit a machine-readable JSON report.
npx stryker run --reporters json# → reports/mutation/mutation.json
Pull out only the entries with status "Survived", reduced to file, line, and the edit, and hand that to the agent.
node -e 'const r = require("./reports/mutation/mutation.json");const out = [];for (const [file, f] of Object.entries(r.files)) { for (const m of f.mutants) { if (m.status === "Survived") { out.push(`${file}:${m.location.start.line} ${m.mutatorName} → ${m.replacement}`); } }}console.log(out.join("\n"));' > survived.txt
Then give the Antigravity chat agent an instruction like the following. The crucial part is forbidding it to touch the implementation.
survived.txt lists the mutants the current tests miss. Add only test cases that kill each mutant. Do not change the implementation in adReady.js. After adding tests, run npx stryker run and repeat until the survivor count is 0.
By freezing the implementation and pointing the agent at concrete targets — the surviving mutants — you keep it from spraying out pass-only tests. With the edits it must kill given as a list, the assertions naturally gravitate toward the core of the behavior.
In my own use, the quality of the generated tests changed visibly depending on whether I handed over this list. Without it, the agent drifts toward weak assertions; with it, it reaches for boundary values and negative paths on its own.
Why chasing 100% is a mistake
One warning up front. Pinning the mutation score to 100% actually breaks your workflow.
The cause is equivalent mutants. A mutant that doesn't change behavior at all — say a <=/< difference on a loop bound that never affects the final result, or an edit inside dead code — is impossible for any test to kill, by definition. These are not blind spots; they are mutants that are correct to leave alive.
A tool can't fully decide this equivalence. So if you force 100%, the agent starts trying to kill the unkillable, writing strange assertions that don't engage the implementation's meaning, or tests so tightly coupled to internal details that they break on every refactor. That is a different kind of decay from weak tests, but just as troublesome.
In my own use, I review surviving mutants one by one and sort each into "blind spot" or "equivalent," excluding the equivalent ones explicitly with Stryker's // Stryker disable comment. I keep the threshold at 80–90% and own the rest by hand. With that line drawn, the targets I hand to the agent narrow down to only what genuinely should be killed.
How far to take it — thresholds and diff mutation
Mutation testing is expensive. Running it across the whole repo every time takes too long to sustain. Two practical landing spots.
First, scope it to the diff. Stryker can evaluate only changed files.
npx stryker run --since main
Running mutation only on the code a pull request touched finishes in tens of seconds. You protect the effectiveness of newly written code and accept that as the boundary.
Second, don't use a single uniform threshold. For modules where a mistake costs real money — billing, auth, ad-display conditions — demand a high score, and relax it for places like log formatting where breakage is minor. Splitting the mutate patterns or keeping per-directory config is the manageable way to do this.
You don't need everything at 100%. The goal is to hold one operational brake: don't trust an agent's green tests unconditionally.
As a next step, pick the single module you least want to break, and add stryker run --since main as a CI check there. The moment the surviving mutants print as a list, what your tests were missing becomes clear — not in words, but in numbers.
Thank you for reading. I hope it serves as a first filter for anyone who, like me, has started handing tests over to AI.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.