ANTIGRAVITY LABJP
Articles/Agents & Manager
Agents & Manager/2026-06-29Advanced

When Your Antigravity-Written Tests Only Look Green: Measuring Effectiveness with Mutation Testing

Tests written by an Antigravity agent can pass and still fail to catch the bug that matters. Here is how to measure their real effectiveness with mutation testing and only adopt tests after they kill the surviving mutants — with working code.

Antigravity289testing15mutation testingquality assurance4agents110

Premium Article

As an indie developer, while reworking the AdMob ad-display logic in one of my apps, I asked an Antigravity agent to "write tests for this module." Back came a dozen-plus cases, all green. I shipped with confidence — and a bug slipped through where, under one specific condition, the interstitial fired twice.

Every test had passed.

Reading them back, the generated tests mostly checked that "the function doesn't throw" or that "the return value is truthy." Not one of them asserted the behavior right at the boundary, where the logic actually changes.

The more you delegate tests to an agent, the more your whole workflow hinges on spotting these pass-only tests. This article walks through using mutation testing — which measures the effectiveness of the tests themselves — to filter them before you adopt them.

Why a green test alone can't be trusted

A passing test fails to distinguish two completely different states. One is "it passes because the implementation is correct." The other is "it passes because the test verifies nothing."

Coverage has the same trap. 100% line coverage only guarantees that a line ran. If a line executes but nobody checks the result, bugs walk straight through.

AI-written tests lean toward exactly this kind of weak assertion. Unless you say otherwise in the prompt, the agent takes the shortest path to "a test that doesn't fail." Even tautological assertions — recomputing a result with the same expression and comparing it to itself — pass without complaint.

What you need is not a measure of "did the test pass" but of "can the test catch a bug." That is what mutation testing gives you.

What mutation testing actually measures

The idea is simple. Deliberately inject a tiny bug (a mutant) into the implementation and see whether the tests notice and fail.

For example, change >= to >, + to -, return true to return false, && to ||. A tool generates many such single-point edits mechanically.

For each mutant, there are two outcomes.

OutcomeMeaningWhat it says about your tests
killedSome test failed because of the editA good test that catches that bug
survivedEvery test stayed green despite the editA blind spot nobody verifies

The share of mutants you kill is the mutation score. The surviving mutants are, quite literally, a list of behaviors your tests overlook. The double-ad bug from the opening surfaced exactly this way once I ran it later: "the mutant that flips the double-fire guard survived."

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
The pattern that makes AI-generated tests pass without verifying anything, and how a mutation score exposes it
A minimal Stryker setup for an effectiveness gate, and how to wire it into the agent loop
A realistic loop that hands surviving mutants to the agent so it targets exactly the blind spots
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Agents & Manager2026-04-08
Antigravity Agent Output Validation Errors: A Practical Fix Guide
Fix Antigravity agent output validation errors systematically: format failures, quality issues, and consistency problems across multi-agent runs — with prompt design, auto-repair patterns, self-checking, and consensus aggregation.
Agents & Manager2026-06-28
A Review Gate Design for Safely Folding Parallel Agents' Diffs into One Branch
Antigravity 2.0 made running multiple agents in parallel practical, but verifying each agent's output and integrating it into one branch is left to you. Here is how to build a diff-level review gate in stages, with judgment criteria and scripts.
Agents & Manager2026-06-28
Treating Built-in Guide Skills as Design Assets, Not Throwaway Prompts
Antigravity v2.2.1 added built-in Guide skills. Here is a concrete structure and set of judgment calls for running them as version-controlled, shared design assets instead of one-off instructions.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →