When the Default Model Changes Underneath You: Pinning and Diff-Gating Scheduled Runs

Antigravity 2.0 promoted Gemini 3.5 Flash to the default fast model. It is a welcome upgrade, but any scheduled run that leaned on the default starts producing subtly different output one morning. Here is how I pin the model explicitly, fingerprint the output, and gate drift, sized for a solo developer's pipeline.

Antigravity 2.0⁵ Gemini 3.5 Flash⁴ model pinning drift detection scheduled runs reproducibility⁴ indie development¹³ regression testing

✦ Premium Article

One morning I opened a draft from a publishing job that should have run exactly as it always did, and noticed the headings were styled just slightly differently. Nothing in the body was wrong. But the rhythm of the bullet lists and the way paragraphs broke belonged to someone else compared to the week before.

The cause was neither my prompt nor my settings. Antigravity 2.0 had quietly promoted Gemini 3.5 Flash to the default fast model. On the benchmarks it is described as beating the previous-generation Pro, with 76.2% on Terminal-Bench 2.1 and 83.6% on MCP Atlas (source: Google Developers Blog). As raw capability, the change is welcome. But for a scheduled run that had been "trusting the default," it meant the character of my output had been swapped out somewhere I wasn't watching.

This article is about the design that keeps that swap from becoming an incident. It draws on my own pipeline as an indie developer, where I auto-generate drafts for several sites every day.

Why "trusting the default" is risky only for scheduled runs

When you work interactively, a default model change rarely causes trouble. You see the output, and if it looks off you resubmit on the spot. A human performs the final inspection.

Scheduled runs are different. They run overnight, and the result flows straight into the next stage with nobody inspecting it. So even a change that is "better on average" drifts silently if your downstream was built around the previous model's habits. If a formatting script waits for a particular heading marker, or a minimum-length gate assumes the old model's volume, you can end up with the perverse outcome of a gate failing precisely because quality went up.

The problem, then, is not that the new model is bad. It is that the assumptions behind your output moved at a time you weren't aware of. What deserves attention is not which model is better, but the invisibility of the change.

Pin the model explicitly first

The first move is simple. Don't let the scheduled agent lean on the default; write out the model it uses.

{
  "agent": "nightly-draft",
  "model": "gemini-3.1-pro",
  "fallback": "gemini-3.5-flash",
  "temperature": 0.2,
  "note": "Model is explicit so a default change can't reach this agent. Fallback only when the pinned model is briefly unavailable."
}

With model declared, this agent keeps using the model you named even when the platform's default moves. The fallback is a statement that says "drop here only when the pinned model can't be reached." That is different in meaning from following the default automatically. The former is a choice you made between two options; the latter is one option the platform chose for you.

Pinning, though, only buys time. Leave it pinned forever and the old model's support eventually ends. The real purpose of pinning is to let you choose the timing of the change. For that, you need the diff gate that follows.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A minimal config that pins the model and declares a fallback so a default swap can't quietly change your output

✦A complete diff gate that normalizes and fingerprints output, failing with exit 1 only on changes you didn't sanction

✦A migration procedure for welcoming upgrades on your own terms: snapshot updates and promoting one agent at a time

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Fingerprint the output and stop only the changes you can't accept

Once pinning has bought you time, add a mechanism that mechanically catches "what changed and how" when you do raise the model. I use a small gate: save the output for representative inputs as snapshots, and from then on compare normalized fingerprints.

#!/usr/bin/env python3
"""Snapshot output and detect model-swap changes as a diff gate."""
import json
import sys
import hashlib
import difflib
from pathlib import Path
 
SNAP_DIR = Path("snapshots")
 
 
def fingerprint(text: str) -> str:
    # Strip meaningless noise (trailing spaces, final newlines) before fingerprinting
    normalized = "\n".join(line.rstrip() for line in text.strip().splitlines())
    return hashlib.sha256(normalized.encode("utf-8")).hexdigest()[:16]
 
 
def check(case_id: str, produced: str) -> bool:
    SNAP_DIR.mkdir(exist_ok=True)
    ref_path = SNAP_DIR / f"{case_id}.txt"
 
    # On the first run, save as the baseline and treat it as a pass
    if not ref_path.exists():
        ref_path.write_text(produced, encoding="utf-8")
        print(f"[seed]  {case_id}: created baseline snapshot")
        return True
 
    reference = ref_path.read_text(encoding="utf-8")
    if fingerprint(reference) == fingerprint(produced):
        print(f"[ok]    {case_id}: output matches the baseline")
        return True
 
    # On a mismatch, print the diff and stop. A human decides if the change is welcome
    diff = "\n".join(difflib.unified_diff(
        reference.splitlines(),
        produced.splitlines(),
        fromfile="snapshot",
        tofile="current",
        lineterm="",
    ))
    print(f"[drift] {case_id}: output drifted from the baseline\n{diff}")
    return False
 
 
if __name__ == "__main__":
    # Read {"case_id": "...", "output": "..."} from stdin
    payload = json.load(sys.stdin)
    ok = check(payload["case_id"], payload["output"])
    sys.exit(0 if ok else 1)

The important part is that I normalize before fingerprinting. Turn meaningless noise such as trailing spaces and final newlines into diffs and the gate cries so often nobody reads it. Normalize too aggressively and you absorb changes you wanted to catch, like a shift in paragraph structure. I keep it to "strip trailing whitespace" and "strip leading and trailing whitespace," letting headings and paragraph breaks land in the fingerprint as-is. The boundary between a change that should ring and noise that shouldn't is set by how strong the normalization is.

This gate only tells you whether the output changed. It does not judge better or worse. I deliberately leave that to a person. The gate's job is to make you aware that something changed, not to guarantee correctness.

A procedure for accepting or rejecting a change once it rings

When a diff appears, I handle it in this order. Having a procedure means the version of me awake at midnight doesn't have to deliberate.

Stage	What to do	Decision criterion
1. Isolate	Decide whether the diff comes from the model or the input data	If it reproduces on the same input, it's the model
2. Evaluate	Judge by eye whether the change is an improvement, equivalent, or a regression	Does downstream formatting or gating break?
3. Accept	If the change is welcome, update the snapshot	Save the new output as the new baseline
4. Reject	If it's a regression, revert to the pinned model or shore up the prompt	Hold until downstream assumptions are met

Updating a snapshot is just overwriting the file and committing it to version control. Because "I accepted this change" stays in the history, you can later trace when the character changed. I make a point of writing both the old pinned model name and the new model name in that update commit message. For the version of me half a year later, that is the single best clue.

Production pitfalls I hit, and how I worked around them

While wiring this gate into real scheduled runs, I hit a few pitfalls. Here is what I noticed in production, along with the workarounds, so that anyone setting up the same mechanism can avoid tripping where I did.

Don't register so many snapshots that the gate becomes a formality

At first I greedily registered dozens of representative inputs. But natural variation in the input data set off diffs too, and before long nobody read them. I narrowed it to "inputs that genuinely hurt if downstream breaks" and went back to keeping 5 to 7. Holding down how often it rings is the shortest path to keeping the gate alive. Trying to watch everything, and ending up watching nothing, was the real failure.

Lower the temperature before you pin

Even with the model pinned, a high temperature makes the same input wobble run to run and trips the diff. I recommend pulling the temperature low for the inputs you snapshot. Build the foundation of reproducibility first, then lay the diff gate on top. If the foundation wobbles, the gate keeps detecting your own noise rather than a model change.

Route diffs to a failure log, not a notification

Early on I piped diffs into a notification channel, but as the volume grew the diffs worth seeing got buried. I switched to leaving diffs only in the exit code and a failure log, and reviewing the failed runs together the next morning. What I want from a notification is just "a run failed"; the contents belong somewhere I can read calmly. That one change alone stopped my diff handling from slipping to the back of the queue.

Re-shaping operations to welcome the default upgrade

Once these pieces are in place, your stance toward default model upgrades changes. They stop being something to brace against and avoid, and become something you absorb on your own schedule.

Aspect	No pin, following the default	Pinned, with a diff gate
How the upgrade feels	Changes one morning without your knowing	Changes on the day you raise it, by the amount you raised
Spotting the change	You notice only when downstream breaks	The gate tells you first, as a diff
Rolling back	Starts with finding the cause	Just revert to the pinned model
Adopting the gains	Passive, all or nothing	Verify on representative inputs, then roll out

When you adopt a new model, don't switch every agent at once. Raise just one agent that holds representative inputs from the pinned model to the new one, and run the diff gate once. If you confirm the change is welcome, follow the rest along in order. This "promote one first" approach is the same thinking as the steps for raising a CLI version. You keep the room for breakage down to one agent's worth, always.

As a solo developer, you can't add inspectors. That is exactly why it's worth investing in small mechanisms that make change visible. In my own draft generation for Dolice Labs, once I put this diff gate in place, the number of mornings I opened a draft feeling puzzled dropped noticeably.

Build with the assumption that the default will move, and the next day a model is swapped becomes not an event to panic over but simply "the day I raise it." Start by picking the single scheduled run you'd least want to break, and taking a snapshot of its output. Thank you for reading.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.