Articles/Agents & Manager

◈ Agents & Manager/2026-07-05Advanced

Protecting Your Agent Stack's Known-Good State with a Single Lockfile — Change-Budget Design for an Era of Simultaneously Moving Parts

When the IDE build, CLI, model, and dependencies all move at once, you can no longer tell which one caused a regression. Here is a change-budget design that pins your known-good state to one lockfile, with working code and operational logs.

antigravity⁴¹⁵ agents¹²¹ reproducibility⁵ change-management production⁷⁰

✦ Premium Article

On July 1st, Antigravity shipped v2.2.1, v2.1.4, and v2.0.11 on the same day — a setup where you can pick between a stable line and a feature line. Around the same time, my local CLI moved from the Gemini CLI to the Antigravity CLI, and the default model quietly updated.

As an indie developer, I run daily article generation for four technical blogs, and I hand that work to agents. One morning, the tables in a generated article came out broken. I went to investigate, and my hand stopped over the keyboard.

Between the last known-good run and that morning, too many parts had moved. The IDE build had bumped, the CLI had swapped to a different implementation, the model had refreshed, and two dependency packages had gone up. Which one broke it? There was nothing to bisect from.

This article is the change-budget design I rebuilt out of that morning's regret. It is not about flashy automation. It is about the quiet foundation that lets you find the cause of a regression quickly, in an environment where the number of moving parts keeps growing.

When moving parts multiply, you lose track of what broke it

An agent development stack becomes, almost without notice, a collection of independently moving axes. At minimum, these four update on their own separate schedules.

Moving axis	Who updates it	Felt cadence
Antigravity IDE build	Official releases (stable and feature lines in parallel)	Weekly
Antigravity CLI	Updates tied to tool consolidation and migration	Irregular, bursty during migration
Default model	Platform-side swaps	Quiet, without notice
Dependency packages	Your own `npm update` or auto-updates	Anytime

The trouble is that these matter in combination. With four axes, the number of possible states grows exponentially. Even if each axis moves just once in a week, the gap from your last known-good run is already four axes wide.

When a regression hits, we try to reconstruct what changed. But when four axes of change are stacked on top of each other at once, the very starting point for a bisection is gone. There are too many suspects, and you fall back on intuition.

In unattended runs, this gets heavier still. If a pipeline you ran overnight produces broken output, and you never captured a snapshot of the environment at that moment, then by morning all you have is "the environment as it is now." The state at the moment it broke is no longer reproducible.

Pin the known-good state to a single lockfile

The first thing to do is write down the environment of the last run that worked, in one file. Take the same idea that an application's package-lock.json uses to pin dependencies, and extend it across the whole stack.

I keep one file named stack.known-good.yml at the root of the repository.

# stack.known-good.yml
# Records the last state that ran to completion correctly with "this combination"
known_good:
  captured_at: "2026-07-01T09:00+09:00"
  note: "Article generation across 4 sites finished with JA=EN parity and no broken tables"
  axes:
    antigravity_ide: "2.0.11"
    antigravity_cli: "0.4.2"
    default_model: "gemini-3.5-flash-2026-06"
    node: "22.14.0"
  deps:
    "@antigravity/sdk": "2.3.1"
    "yaml": "2.6.1"
# Change budget: the max number of axes you may move at once from the last known-good state
change_budget: 1

Three things matter here.

First, record each axis's version in a form a human can read. A hash alone will not help you remember "which model was that." Add a date and a short note.

Second, write the condition that declares "good" in the note. What counts as "good" differs per operation. For me it is "the article counts match across Japanese and English, and it finished without broken tables." Putting that criterion into words keeps your judgment steady the next time you update.

Third, keep the change budget in the same file. When the budget lives alongside the environment, it is visible at a glance during review.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A preflight that pins the IDE build, CLI, model, and dependency versions to one lockfile and counts drifted axes before an unattended run

✦The idea of a change budget that limits you to moving one axis at a time, plus a bisection procedure to isolate the culprit when a regression appears

✦Real logs from running daily article generation across four sites, showing how time-to-diagnosis changed

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

The change budget — move only one axis at a time

A change budget is the number of axes you allow yourself to move in a single update, measured from the last known-good state. I set it to 1 as a rule.

The reasoning is simple. If you only ever move one axis at a time, then when a regression appears the suspect is always narrowed to one. You do not even need to bisect. You just revert the axis you moved last.

You confirm this mechanically, right before an unattended run. A preflight reads the current environment, counts the axes that differ from the known-good state, and stops the run if the budget is exceeded.

#!/usr/bin/env python3
# preflight.py — run this right before an unattended job; halt if the change budget is exceeded
import subprocess
import sys
import yaml
 
 
def read_current_axes() -> dict:
    """Read each axis's version from the current stack. Axes that can't be read become 'unknown'."""
    def sh(cmd: str) -> str:
        try:
            out = subprocess.run(
                cmd, shell=True, capture_output=True, text=True, timeout=20
            )
            return out.stdout.strip() or "unknown"
        except Exception:
            return "unknown"
 
    return {
        "antigravity_ide": sh("agy --ide-version 2>/dev/null"),
        "antigravity_cli": sh("agy --version 2>/dev/null | head -1"),
        "default_model": sh("agy config get default_model 2>/dev/null"),
        "node": sh("node --version 2>/dev/null | tr -d 'v'"),
    }
 
 
def diff_axes(known: dict, current: dict) -> list:
    """Return the axes that disagree between known-good and current."""
    drifted = []
    for axis, good_value in known.items():
        now = current.get(axis, "unknown")
        if now != good_value:
            drifted.append((axis, good_value, now))
    return drifted
 
 
def main() -> int:
    with open("stack.known-good.yml", encoding="utf-8") as f:
        manifest = yaml.safe_load(f)
 
    budget = manifest.get("change_budget", 1)
    known = manifest["known_good"]["axes"]
    current = read_current_axes()
    drifted = diff_axes(known, current)
 
    if not drifted:
        print("preflight: matches known-good. Allowing the unattended run.")
        return 0
 
    print(f"preflight: {len(drifted)} axis/axes drifted from known-good (budget {budget})")
    for axis, good, now in drifted:
        print(f"  - {axis}: {good} -> {now}")
 
    if len(drifted) > budget:
        print("preflight: change budget exceeded. Aborting the unattended run.")
        print("  -> Verify axes one at a time, then update known-good once confirmed.")
        return 1
 
    print("preflight: within budget. Allowing the run, but recording this axis as today's suspect.")
    return 0
 
 
if __name__ == "__main__":
    sys.exit(main())

The goal of this script is not to prevent regressions. It is to keep the suspect list down to one when a regression does happen. You move exactly one axis within budget, run, and if the result is good you update known-good and move to the next. If it breaks, reverting that one axis takes you back to the known-good state.

Retrieval commands like agy --ide-version may be named differently in your version. Axes that cannot be read are treated as unknown and counted as drifted. Rather than optimistically assuming an unreadable axis is "probably the same," it is safer to put it explicitly onto the suspect list.

The bisection procedure when a regression appears

If you keep to the budget, bisection is unnecessary. But there are moments when several axes unavoidably move at once — a migration, for instance. July's CLI migration was exactly that. For those moments, decide the isolation procedure in advance.

Step	Action	Purpose
1	Enumerate the diff between all current axes and known-good	Fix the set of suspects
2	Revert half of the drifted axes back to known-good	Split the search space
3	Run once with the smallest reproducing case	Judge good vs. regression
4	If good, narrow to the other half; if regressed, narrow to the reverted half	Halve the range
5	Repeat 2–4 until one axis remains	Confirm the culprit axis

The key is the "smallest reproducing case" in step 3. If you judge by running all four sites, each trial takes too long, and the bisection will not finish in a practical number of rounds. I keep a separate reproducing case that takes about thirty seconds — "generate a single article and look at whether the tables break." Isolating the cause and running the heavy production job are best kept as separate concerns.

To make the revert reliable, note the pinning method for each axis too. The IDE build is an explicit stable version, the CLI is a pinned version, the model is explicitly set in config, and dependencies are the lockfile. If there is an axis you "cannot revert," the bisection breaks down there — so whether a revert path exists is precisely what you should check beforehand.

Wiring it into unattended runs — let the preflight stop it

Put the preflight at the very start of the scheduled run. If the budget is exceeded, exit without running the main job, and notify only on failure. In unattended operation, "silently producing broken output" is the worst outcome; "we skipped today because the budget was exceeded" is far easier to handle.

#!/usr/bin/env bash
# nightly.sh — the entrance to an unattended pipeline
set -euo pipefail
 
if ! python3 preflight.py; then
  echo "Skipped today's unattended run because the change budget was exceeded." >&2
  # Notify on failure only (silent when healthy)
  exit 0
fi
 
# Run the main job only when the preflight passes
python3 generate_articles.py

As a result, the state of an unattended run resolves into three cases.

State	Preflight verdict	What you have by morning
Healthy	No diff, or one axis within budget	The artifacts plus a record of the suspect axis you moved
Skipped	Aborted on budget overrun	A list of drifted axes and a notification
Needs review	Passed, but the artifacts are wrong	The one axis moved just before — a near-certain culprit

The "needs review" row is the biggest payoff of this design. As long as you kept to the budget, even when something goes wrong the suspect is narrowed to the single axis you moved last. The starting point for the investigation becomes a log, not a hunch.

What running it taught me

Before I put this foundation in place, diagnosing a regression often burned half a day. Too many axes had moved, so all I could do was revert them one by one and try again.

After narrowing the budget to one and updating known-good every time, diagnosis became mostly just "revert the axis you moved last." In my own records, isolation time dropped from the order of tens of minutes to the order of a few minutes. There is nothing flashy about it, but the sense of calm each morning changed.

There is a cost, though: updates get slower. Because you can only move one axis at a time, catching up on all four takes several days, with a good-state confirmation between each step. It sits honestly at odds with the urge to try a new feature right away.

So I keep the experimental environment separate from the production unattended one. In the experimental environment I try the latest without worrying about budget, and if it is good I promote it into production "one axis at a time." Exploring the moving parts and keeping revenue-critical unattended operation stable are simply different goals.

Another lesson was to only put "revertible axes" on the budget. An axis with no revert path breaks the premise of bisection. Keeping things in a form you can reliably revert — version pins and lockfiles — is what supports the whole idea of a change budget.

In an era where the moving parts shift all at once, you cannot freeze everything. What you can do is order the changes one at a time, and always leave a road back to the last known-good state. It is quiet preparation, but the wider you let agents run unattended, the more the presence or absence of this foundation begins to tell.

I hope this helps with your own setup. Thank you for reading.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.