When the Default Model Changes Underneath You: Pinning and Diff-Gating Scheduled Runs
Antigravity 2.0 promoted Gemini 3.5 Flash to the default fast model. It is a welcome upgrade, but any scheduled run that leaned on the default starts producing subtly different output one morning. Here is how I pin the model explicitly, fingerprint the output, and gate drift, sized for a solo developer's pipeline.
One morning I opened a draft from a publishing job that should have run exactly as it always did, and noticed the headings were styled just slightly differently. Nothing in the body was wrong. But the rhythm of the bullet lists and the way paragraphs broke belonged to someone else compared to the week before.
The cause was neither my prompt nor my settings. Antigravity 2.0 had quietly promoted Gemini 3.5 Flash to the default fast model. On the benchmarks it is described as beating the previous-generation Pro, with 76.2% on Terminal-Bench 2.1 and 83.6% on MCP Atlas (source: Google Developers Blog). As raw capability, the change is welcome. But for a scheduled run that had been "trusting the default," it meant the character of my output had been swapped out somewhere I wasn't watching.
This article is about the design that keeps that swap from becoming an incident. It draws on my own pipeline as an indie developer, where I auto-generate drafts for several sites every day.
Why "trusting the default" is risky only for scheduled runs
When you work interactively, a default model change rarely causes trouble. You see the output, and if it looks off you resubmit on the spot. A human performs the final inspection.
Scheduled runs are different. They run overnight, and the result flows straight into the next stage with nobody inspecting it. So even a change that is "better on average" drifts silently if your downstream was built around the previous model's habits. If a formatting script waits for a particular heading marker, or a minimum-length gate assumes the old model's volume, you can end up with the perverse outcome of a gate failing precisely because quality went up.
The problem, then, is not that the new model is bad. It is that the assumptions behind your output moved at a time you weren't aware of. What deserves attention is not which model is better, but the invisibility of the change.
Pin the model explicitly first
The first move is simple. Don't let the scheduled agent lean on the default; write out the model it uses.
{ "agent": "nightly-draft", "model": "gemini-3.1-pro", "fallback": "gemini-3.5-flash", "temperature": 0.2, "note": "Model is explicit so a default change can't reach this agent. Fallback only when the pinned model is briefly unavailable."}
With model declared, this agent keeps using the model you named even when the platform's default moves. The fallback is a statement that says "drop here only when the pinned model can't be reached." That is different in meaning from following the default automatically. The former is a choice you made between two options; the latter is one option the platform chose for you.
Pinning, though, only buys time. Leave it pinned forever and the old model's support eventually ends. The real purpose of pinning is to let you choose the timing of the change. For that, you need the diff gate that follows.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A minimal config that pins the model and declares a fallback so a default swap can't quietly change your output
✦A complete diff gate that normalizes and fingerprints output, failing with exit 1 only on changes you didn't sanction
✦A migration procedure for welcoming upgrades on your own terms: snapshot updates and promoting one agent at a time
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Fingerprint the output and stop only the changes you can't accept
Once pinning has bought you time, add a mechanism that mechanically catches "what changed and how" when you do raise the model. I use a small gate: save the output for representative inputs as snapshots, and from then on compare normalized fingerprints.
#!/usr/bin/env python3"""Snapshot output and detect model-swap changes as a diff gate."""import jsonimport sysimport hashlibimport difflibfrom pathlib import PathSNAP_DIR = Path("snapshots")def fingerprint(text: str) -> str: # Strip meaningless noise (trailing spaces, final newlines) before fingerprinting normalized = "\n".join(line.rstrip() for line in text.strip().splitlines()) return hashlib.sha256(normalized.encode("utf-8")).hexdigest()[:16]def check(case_id: str, produced: str) -> bool: SNAP_DIR.mkdir(exist_ok=True) ref_path = SNAP_DIR / f"{case_id}.txt" # On the first run, save as the baseline and treat it as a pass if not ref_path.exists(): ref_path.write_text(produced, encoding="utf-8") print(f"[seed] {case_id}: created baseline snapshot") return True reference = ref_path.read_text(encoding="utf-8") if fingerprint(reference) == fingerprint(produced): print(f"[ok] {case_id}: output matches the baseline") return True # On a mismatch, print the diff and stop. A human decides if the change is welcome diff = "\n".join(difflib.unified_diff( reference.splitlines(), produced.splitlines(), fromfile="snapshot", tofile="current", lineterm="", )) print(f"[drift] {case_id}: output drifted from the baseline\n{diff}") return Falseif __name__ == "__main__": # Read {"case_id": "...", "output": "..."} from stdin payload = json.load(sys.stdin) ok = check(payload["case_id"], payload["output"]) sys.exit(0 if ok else 1)
The important part is that I normalize before fingerprinting. Turn meaningless noise such as trailing spaces and final newlines into diffs and the gate cries so often nobody reads it. Normalize too aggressively and you absorb changes you wanted to catch, like a shift in paragraph structure. I keep it to "strip trailing whitespace" and "strip leading and trailing whitespace," letting headings and paragraph breaks land in the fingerprint as-is. The boundary between a change that should ring and noise that shouldn't is set by how strong the normalization is.
This gate only tells you whether the output changed. It does not judge better or worse. I deliberately leave that to a person. The gate's job is to make you aware that something changed, not to guarantee correctness.
A procedure for accepting or rejecting a change once it rings
When a diff appears, I handle it in this order. Having a procedure means the version of me awake at midnight doesn't have to deliberate.
Stage
What to do
Decision criterion
1. Isolate
Decide whether the diff comes from the model or the input data
If it reproduces on the same input, it's the model
2. Evaluate
Judge by eye whether the change is an improvement, equivalent, or a regression
Does downstream formatting or gating break?
3. Accept
If the change is welcome, update the snapshot
Save the new output as the new baseline
4. Reject
If it's a regression, revert to the pinned model or shore up the prompt
Hold until downstream assumptions are met
Updating a snapshot is just overwriting the file and committing it to version control. Because "I accepted this change" stays in the history, you can later trace when the character changed. I make a point of writing both the old pinned model name and the new model name in that update commit message. For the version of me half a year later, that is the single best clue.
Production pitfalls I hit, and how I worked around them
While wiring this gate into real scheduled runs, I hit a few pitfalls. Here is what I noticed in production, along with the workarounds, so that anyone setting up the same mechanism can avoid tripping where I did.
Don't register so many snapshots that the gate becomes a formality
At first I greedily registered dozens of representative inputs. But natural variation in the input data set off diffs too, and before long nobody read them. I narrowed it to "inputs that genuinely hurt if downstream breaks" and went back to keeping 5 to 7. Holding down how often it rings is the shortest path to keeping the gate alive. Trying to watch everything, and ending up watching nothing, was the real failure.
Lower the temperature before you pin
Even with the model pinned, a high temperature makes the same input wobble run to run and trips the diff. I recommend pulling the temperature low for the inputs you snapshot. Build the foundation of reproducibility first, then lay the diff gate on top. If the foundation wobbles, the gate keeps detecting your own noise rather than a model change.
Route diffs to a failure log, not a notification
Early on I piped diffs into a notification channel, but as the volume grew the diffs worth seeing got buried. I switched to leaving diffs only in the exit code and a failure log, and reviewing the failed runs together the next morning. What I want from a notification is just "a run failed"; the contents belong somewhere I can read calmly. That one change alone stopped my diff handling from slipping to the back of the queue.
Re-shaping operations to welcome the default upgrade
Once these pieces are in place, your stance toward default model upgrades changes. They stop being something to brace against and avoid, and become something you absorb on your own schedule.
Aspect
No pin, following the default
Pinned, with a diff gate
How the upgrade feels
Changes one morning without your knowing
Changes on the day you raise it, by the amount you raised
Spotting the change
You notice only when downstream breaks
The gate tells you first, as a diff
Rolling back
Starts with finding the cause
Just revert to the pinned model
Adopting the gains
Passive, all or nothing
Verify on representative inputs, then roll out
When you adopt a new model, don't switch every agent at once. Raise just one agent that holds representative inputs from the pinned model to the new one, and run the diff gate once. If you confirm the change is welcome, follow the rest along in order. This "promote one first" approach is the same thinking as the steps for raising a CLI version. You keep the room for breakage down to one agent's worth, always.
As a solo developer, you can't add inspectors. That is exactly why it's worth investing in small mechanisms that make change visible. In my own draft generation for Dolice Labs, once I put this diff gate in place, the number of mornings I opened a draft feeling puzzled dropped noticeably.
Build with the assumption that the default will move, and the next day a model is swapped becomes not an event to panic over but simply "the day I raise it." Start by picking the single scheduled run you'd least want to break, and taking a snapshot of its output. Thank you for reading.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.