ANTIGRAVITY LABJP
Articles/App Development
App Development/2026-06-30Advanced

When Every Antigravity-Written Test Is Green but the Same Bug Comes Back — Field Notes on Measuring Hollow Assertions

Your AI-written tests all pass, coverage is high, yet the same defect returns to production. The cause is over-mocking and tautological assertions. These are field notes on using mutation testing as ground truth to measure what your tests actually protect, and to fix it operationally.

antigravity404app-dev44testing16mutation-testingvitest2strykerci-cd12

Premium Article

The morning green tests didn't save me

A few days after shipping a fix, a defect I had supposedly closed came back in exactly the same shape. The tests were all green, coverage sat in the low 90s, yet in production the discount was being applied to the tax-inclusive price instead of the pre-tax one. Re-running the suite locally, everything still passed.

Digging in, the cause was neither the number of tests nor the coverage figure. The tests the Antigravity agent had generated when I asked it to "write tests for this function" were passing through the code without verifying a single thing about its behaviour. They executed, so line coverage climbed. But the logic could be wrong and nobody would notice. "Executed" and "verified" are not the same thing — and a regression bug is a blunt way to relearn that.

These are field notes on making the quiet hollowness of AI-written tests visible with mutation testing, and on fixing it in day-to-day operation. The examples use Vitest and Stryker Mutator, but the reasoning carries straight over to Jest.

Why AI-generated tests pass without protecting anything

An agent reads the code you give it and generates tests that make it pass. That's the trap. The shortest path to a passing test isn't to pin down behaviour — it's to rubber-stamp the current output. The three patterns I hit over and over were these.

PatternSymptomWhy it can't catch regressions
Over-mockingEvery dependency mocked with fixed return valuesYou're testing the mock's configured values, not the implementation — the real logic never runs
TautologyThe expected value is computed with the same formula as the implementationIf the implementation is wrong, the expectation is wrong in the same way, so they always match
Snapshot rubber-stampingFirst output is frozen as a snapshot and compared thereafterA wrong output gets baked in as the correct answer

Tautology is the hardest to catch. In a test for a discount function, an agent will often write something like this.

import { describe, it, expect } from 'vitest'
import { applyDiscount } from '../src/cart/discount'
 
it('applies the discount', () => {
  const price = 1000
  const rate = 0.1
  // ❌ expectation computed with the same formula as the implementation (hollow)
  const expected = price - price * rate
  expect(applyDiscount(price, rate)).toBe(expected)
})

This goes green. But whether applyDiscount discounts the pre-tax or the tax-inclusive amount, as long as expected is derived from the same expression the two always agree. All it verifies is that writing the same formula twice yields the same number. The spec is never exercised. It should have been written like this.

it('applies a 10% discount to the pre-tax price', () => {
  // ✅ pin the expectation to a literal that doesn't depend on the implementation
  expect(applyDiscount(1000, 0.1)).toBe(900)
})
 
it('leaves the price unchanged at rate 0', () => {
  expect(applyDiscount(1000, 0)).toBe(1000)
})
 
it('reduces the price to 0 at rate 1', () => {
  expect(applyDiscount(1000, 1)).toBe(0)
})

Decouple the expectation from the implementation's formula and pin it to a constant you worked out by hand. That alone turns the test red the moment the logic drifts. When you delegate to AI, whether you hand it the constraint "expectations must be literals" up front changes the result dramatically.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
How to surface coverage theater — 90% line coverage with a low mutation score — by reading Stryker's survived mutants
An assertion audit that spots the three ways AI-generated tests run hollow: over-mocking, tautology, and snapshot rubber-stamping
A CI pattern that gates mutation score on changed files only, stopping regressions without slowing the pipeline
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Agents & Manager2026-04-16
Building an AI Test Pipeline with Antigravity Agents: Automating Quality Assurance in Production
Learn how to build a production-grade automated test pipeline using Antigravity's AI agents — from unit test generation to E2E testing with Playwright, complete with validation layers and CI/CD integration.
App Dev2026-06-28
Adding Mediation Partners Quietly Starved My iOS Attribution — Reconciling SKAdNetwork IDs Across Four Apps
I added mediation partners but iOS revenue barely moved — the cause was missing SKAdNetwork IDs in Info.plist. Here is how I reconciled SKAdNetworkItems across four apps, using an Antigravity agent as the matcher while keeping the revenue decisions by hand.
App Dev2026-06-25
An Agent Granted 'Watch an Ad to Unlock a Wallpaper' Entirely Client-Side — Re-Verifying Reward Grants with AdMob SSV
I asked an Antigravity agent to wire up 'watch a rewarded ad to unlock a wallpaper,' and it returned an implementation that wrote the unlock flag client-side only. Here is why that is not enough, how I re-verified the reward grant with AdMob server-side verification (SSV), and how I stopped double grants too.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →