Articles/App Development

▣ App Development/2026-06-30Advanced

When Every Antigravity-Written Test Is Green but the Same Bug Comes Back — Field Notes on Measuring Hollow Assertions

Your AI-written tests all pass, coverage is high, yet the same defect returns to production. The cause is over-mocking and tautological assertions. These are field notes on using mutation testing as ground truth to measure what your tests actually protect, and to fix it operationally.

antigravity⁴⁰⁴ app-dev⁴⁴ testing¹⁶ mutation-testing vitest² stryker ci-cd¹²

✦ Premium Article

The morning green tests didn't save me

A few days after shipping a fix, a defect I had supposedly closed came back in exactly the same shape. The tests were all green, coverage sat in the low 90s, yet in production the discount was being applied to the tax-inclusive price instead of the pre-tax one. Re-running the suite locally, everything still passed.

Digging in, the cause was neither the number of tests nor the coverage figure. The tests the Antigravity agent had generated when I asked it to "write tests for this function" were passing through the code without verifying a single thing about its behaviour. They executed, so line coverage climbed. But the logic could be wrong and nobody would notice. "Executed" and "verified" are not the same thing — and a regression bug is a blunt way to relearn that.

These are field notes on making the quiet hollowness of AI-written tests visible with mutation testing, and on fixing it in day-to-day operation. The examples use Vitest and Stryker Mutator, but the reasoning carries straight over to Jest.

Why AI-generated tests pass without protecting anything

An agent reads the code you give it and generates tests that make it pass. That's the trap. The shortest path to a passing test isn't to pin down behaviour — it's to rubber-stamp the current output. The three patterns I hit over and over were these.

Pattern	Symptom	Why it can't catch regressions
Over-mocking	Every dependency mocked with fixed return values	You're testing the mock's configured values, not the implementation — the real logic never runs
Tautology	The expected value is computed with the same formula as the implementation	If the implementation is wrong, the expectation is wrong in the same way, so they always match
Snapshot rubber-stamping	First output is frozen as a snapshot and compared thereafter	A wrong output gets baked in as the correct answer

Tautology is the hardest to catch. In a test for a discount function, an agent will often write something like this.

import { describe, it, expect } from 'vitest'
import { applyDiscount } from '../src/cart/discount'
 
it('applies the discount', () => {
  const price = 1000
  const rate = 0.1
  // ❌ expectation computed with the same formula as the implementation (hollow)
  const expected = price - price * rate
  expect(applyDiscount(price, rate)).toBe(expected)
})

This goes green. But whether applyDiscount discounts the pre-tax or the tax-inclusive amount, as long as expected is derived from the same expression the two always agree. All it verifies is that writing the same formula twice yields the same number. The spec is never exercised. It should have been written like this.

it('applies a 10% discount to the pre-tax price', () => {
  // ✅ pin the expectation to a literal that doesn't depend on the implementation
  expect(applyDiscount(1000, 0.1)).toBe(900)
})
 
it('leaves the price unchanged at rate 0', () => {
  expect(applyDiscount(1000, 0)).toBe(1000)
})
 
it('reduces the price to 0 at rate 1', () => {
  expect(applyDiscount(1000, 1)).toBe(0)
})

Decouple the expectation from the implementation's formula and pin it to a constant you worked out by hand. That alone turns the test red the moment the logic drifts. When you delegate to AI, whether you hand it the constraint "expectations must be literals" up front changes the result dramatically.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How to surface coverage theater — 90% line coverage with a low mutation score — by reading Stryker's survived mutants

✦An assertion audit that spots the three ways AI-generated tests run hollow: over-mocking, tautology, and snapshot rubber-stamping

✦A CI pattern that gates mutation score on changed files only, stopping regressions without slowing the pipeline

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Change the ruler — mutation score, not coverage

Line coverage only measures whether code ran. Hollow tests run too, so coverage cheerfully clears 90%. What I came to trust instead is the mutation score.

Mutation testing deliberately injects small changes (mutants) into your source: - becomes +, < becomes <=, return true becomes return false. It then runs your existing tests. If a mutant makes a test go red, it was "killed"; if every test stays green despite the mutant, it "survived." Survived mutants are precisely where your tests fail to protect you.

A minimal Stryker Mutator config looks like this.

// stryker.config.mjs
export default {
  testRunner: 'vitest',
  coverageAnalysis: 'perTest',
  mutate: ['src/cart/**/*.ts', '!src/**/*.test.ts'],
  reporters: ['html', 'clear-text', 'json'],
  thresholds: { high: 80, low: 60, break: 50 },
}

Run it and a different landscape appears. For the tautological test above, Stryker can flip price - price * rate to price + price * rate, or price * rate to price / rate, and every test stays green — all survived. In my first real measurement, a group of functions with 91% line coverage scored only 47% on mutation. More than half the mutants slipped through. That gap is what coverage theater actually is.

How to brief Antigravity when it writes tests

An agent's output tracks the precision of your instructions. Instead of "write tests," handing it constraints that forbid hollowness up front worked well. The skeleton of my standard request is this.

@agent Write tests for applyDiscount in src/cart/discount.ts.
Constraints:
- Pin expected values to hand-computed literals; do not reuse the implementation's formula
- Always include boundaries (rate=0, rate=1, negative, price=0)
- Mock dependencies minimally; do not mock pure functions, verify with real values
- Do not use snapshots

I don't trust the generated tests as-is; I run an acceptance check with the mutation score. When mutants survive, the fastest move is to feed the mutant back to the agent.

@agent The following mutants survived under Stryker.
Add tests that kill them (expectations as literals):
- discount.ts:12 "price * rate" -> "price / rate"
- discount.ts:15 ">= 0" -> "> 0"

Given the concrete mutants, the agent fills exactly the unverified branches. Naming the survived mutants returned far less hollow tests than vaguely asking it to "raise coverage."

Audit the mock-to-assertion ratio

Another cheap static signal I use is the ratio of mock calls to assertions. A test that's all mock setup and very few expect calls is likely verifying the mock rather than the implementation. It's not a perfect metric, but it's enough to aim a review.

#!/usr/bin/env bash
# mock-assertion-audit.sh — rough count of mock setup vs assertions per test file
for f in $(find src -name '*.test.ts'); do
  mocks=$(grep -cE 'vi\.(mock|fn|spyOn)|mockReturnValue|mockResolvedValue' "$f")
  asserts=$(grep -cE 'expect\(' "$f")
  # warn on files with more mock setup than assertions
  if [ "$mocks" -gt "$asserts" ]; then
    echo "⚠️  $f : mocks=$mocks asserts=$asserts (over-mocking suspected)"
  fi
done

When I ran mutation testing first on the files this script flagged, the survived mutants were almost all concentrated there. Freeze pure logic behind mocks and the test becomes nothing more than a mirror of the implementation.

Stop regressions without slowing CI

Mutation testing is heavy. Running it across the whole codebase every time isn't realistic. I settled on scoping it to changed files and breaking on a score threshold.

name: Mutation Gate
on:
  pull_request:
    branches: [main]
 
jobs:
  mutation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: actions/setup-node@v4
        with:
          node-version: 22
      - run: npm ci
      # collect only changed src files under test
      - name: Collect changed sources
        id: diff
        run: |
          CHANGED=$(git diff --name-only origin/main...HEAD -- 'src/**/*.ts' \
            | grep -v '.test.ts' | paste -sd, -)
          echo "files=$CHANGED" >> "$GITHUB_OUTPUT"
      - name: Run mutation on changed files
        if: steps.diff.outputs.files != ''
        run: npx stryker run --mutate "${{ steps.diff.outputs.files }}"

When the score drops below thresholds.break, Stryker exits non-zero and CI fails. Narrowing to the changed range kept runs to a few minutes even in a repo with hundreds of files. Gating the whole codebase from day one means existing debt keeps it red forever, so "protect only what you just touched" is the practical line. I keep the line-coverage gate and layer a thin mutation gate on top of it.

Fix the acceptance steps

Before I let AI-written tests into production operation, I run an acceptance check in this order. Fixing the steps keeps the verdict steady even when the agent changes.

Eye-check that the expectation doesn't reuse the implementation's formula (tautology)
Run Stryker on changed files only and read the mutation score and survived mutants
Feed the survived mutants back to the agent by name and have it add tests that kill them

Running just these three steps filled almost all the verification gaps that were hiding behind the line-coverage number. Conversely, leaving survived mutants in place while declaring "coverage is enough" is the single biggest pitfall that invites regressions.

Where to draw the line

As an indie developer handling billing alone, when discount or permission logic quietly drifts, I'm the only person who could notice. So I put a high mutation-score threshold on calculations tied to App Store payments and Stripe billing, and treat visual UI and plain CRUD as adequately covered by line coverage. Putting a strict threshold on everything out of fear of false positives keeps CI red forever on existing debt and tempts you to rip the gate out as a workaround — drawing this line is partly about resisting that temptation.

You don't need a high mutation score on every function. I raise the threshold only on logic where being wrong causes quiet damage — money calculations, permission checks, billing — and treat plain CRUD and visual UI as sufficiently covered by line coverage. Precisely because this is an era where AI can mass-produce tests, turning attention to "what is this protecting" rather than the test count is what actually reduced incidents for me.

A green suite tends to read as a symbol of safety, but it's worth confirming once, with a different ruler, what that green actually guarantees. If you write tests next, run Stryker once on your scariest piece of logic first. The list of survived mutants will hand you a map of exactly where your tests are blank.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.