Keeping Naming and Formatting Stable When Your Model Falls Back
When a model falls back mid-run, an agent's naming conventions and formatting drift quietly. Here is how I enforce a model-independent style contract plus a drift probe to keep output consistent.
Last week I had an Antigravity agent implement the settings screens for four of my apps in one long run. Partway through, I noticed the generated code started to feel different. The early files were consistent camelCase like handleSubmit, but later ones mixed in on_submit, and indentation flipped between tabs and spaces.
The cause was easy to find. Mid-run, the higher-tier model got congested and the agent fell back to a lower tier. When the model changes, its default formatting and naming habits change too. It looks trivial to a human, but review load reliably goes up and diffs bloat with noise.
As an indie developer who has shipped apps for about twelve years, the reviews that have eaten the most of my time were exactly this kind — not missing features, but missing consistency. So instead of hoping the model stays disciplined, I built a way to keep output stable even when it falls back, and I'll walk through it in the order I actually assembled it.
Why fallback produces silent drift
A model fallback usually leaves a trace only in the logs. The agent keeps going without stopping, so only the appearance of the output shifts. That is what makes it dangerous. An error would stop the run and get noticed; drift hides inside a successful completion.
Three things drifted most in my runs:
Naming (camelCase vs. snake_case, presence or absence of is/has prefixes on booleans)
Structure (comment granularity, how finely functions are split, preference for early returns)
Formatters absorb the formatting, but they cannot fix naming or structure. Handing those to human review means the review cost spikes every time the model changes underneath you.
The core idea — a model-independent style contract
My approach was to fix a mechanical contract that any generating model must pass, before trusting any of them. Stop expecting consistency from the model's goodwill, and enforce it at the exit instead.
I split the contract into three layers:
Layer 1: A formatter (Prettier) normalizes formatting unconditionally
Layer 2: A linter (ESLint) mechanically judges naming and structural rules
Layer 3: A drift probe surfaces places where conventions shifted within the same run
Layers 1 and 2 reuse existing assets. The interesting part is Layer 3, because formatters and linters only see rule violations, while drift is a relative phenomenon — the habits changed between the first half and the second half of a run.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A diff-based probe that detects naming and formatting drift caused by a model swap
✦A style contract that enforces ESLint/Prettier and naming rules independent of the model
✦How to measure fallback rate and drift rate, then gate on a tuned threshold
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
First, put down the configuration that folds any model's output into one shape. The assumption is that this runs immediately after the agent's generation step.
Naming is judged mechanically by the linter. ESLint's @typescript-eslint/naming-convention reliably catches snake_case even when the model mixes it in.
That covers formatting and naming. In the agent workflow I wedge four fixed steps: generate, then Prettier, then ESLint --fix, then an ESLint check. Auto-fixable violations get fixed automatically, and only the un-fixable ones reach a human.
Detecting in-run drift from the diff
The Layer 3 drift probe is the center of this design. The goal is to catch cases that pass the linter but where habits changed between the first and second half of the run — for example, a style that started with early returns shifting to nested if blocks.
Across the files a single agent run touched, measure a few simple metrics and compare the first half against the second.
# drift_probe.py — surface where writing habits drifted within a runimport re, subprocess, sysdef metrics(path: str) -> dict: src = open(path, encoding="utf-8").read() lines = src.splitlines() return { "early_return": len(re.findall(r"\breturn\b", src)), "nested_if": sum(1 for l in lines if re.match(r"\s{6,}if\b", l)), "snake_ident": len(re.findall(r"\b[a-z]+_[a-z]+\b", src)), "arrow_fn": len(re.findall(r"=>", src)), "loc": len(lines), }def normalize(m: dict) -> dict: loc = max(m["loc"], 1) return {k: round(v / loc, 4) for k, v in m.items() if k != "loc"}def changed_files() -> list: out = subprocess.check_output( ["git", "diff", "--name-only", "HEAD~1", "HEAD"], text=True ) return [f for f in out.splitlines() if f.endswith((".ts", ".tsx"))]files = changed_files()half = len(files) // 2early = [normalize(metrics(f)) for f in files[:half]]late = [normalize(metrics(f)) for f in files[half:]]def avg(rows, key): return sum(r[key] for r in rows) / max(len(rows), 1)drift = {}for key in ("nested_if", "snake_ident", "arrow_fn"): a, b = avg(early, key), avg(late, key) if a and abs(b - a) / a > 0.35: # treat >35% change as drift drift[key] = (round(a, 3), round(b, 3))if drift: print("⚠️ style drift detected:", drift) sys.exit(1)print("✅ consistent within run")
The 35% threshold is the value that settled after running for a few weeks across my four app repositories. At 10% it fired on normal file variation; at 50% it missed real drift. Tune it per project. Copying it without a basis of your own gives you an alarm that fires so often nobody looks at it.
What tripped me up in production, and the fixes
The first pitfall was not measuring the fallback itself. When drift appeared, I could not tell whether it came from a model swap or simply from the task's nature changing. So I started recording whether a fallback occurred as run metadata and cross-referencing it with the drift result. If the drift rate only spikes on runs where a fallback happened, the cause is the model.
The second pitfall was over-trusting --fix. Prettier and ESLint auto-fix repair formatting, but the fix can occasionally fold in a semantic change as a side effect. Splitting the auto-fix commit from the generation commit — keeping fixes in a separate commit — lets you separate "real change" from "reformatting" at review time. That paid off during diagnosis.
The third was how to operate the threshold. Wiring the drift probe to exit 1 as a merge gate from day one stalls development during the tuning period. For the first two weeks I kept it warning-only (log via exit 0), locked the threshold on real data, and only then turned it into a gate. Standing up a gate before you measure is, in my experience, the most resented way to roll something out.
Where machines decide, and where humans look
My recommendation is to hand formatting and naming 100% to machines. Consistency is all that matters there, and it is not worth a human's deliberation. Structural drift — function splitting, early-return preference — I detect but deliberately do not auto-fix, because structure depends on context and a blanket machine fix can hurt readability instead.
This line mattered most when touching initialization code tied directly to AdMob revenue, where a break shows up in sales immediately. When formatting noise inflates a diff, you miss the one line that actually deserves attention. Erase the noise a machine can erase, and reserve human focus for meaningful diffs. That is the quietest, most dependable foundation I have found for holding quality steady no matter how often the model swaps underneath you.
Running several Dolice Labs sites solo, how fast I can put a number on these quietly-costly phenomena is what decides whether I can keep going. Fallback-induced drift was a textbook case. Measure first, fold with machines, then look at what remains. I intend to keep that order intact in the next repository I touch.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.