Instruction Drift in Scheduled Agents — A Three-Layer Design for Keeping Definitions, Docs, and Reality Aligned
Scheduled agents keep logging success even after their instructions diverge from reality. Here is the three-layer drift-detection design — definition, documentation, reality — I built after silent failures in my own operations.
I was reading back through the logs of my overnight scheduled runs when something caught my eye. As an indie developer I have Antigravity agents run recurring jobs against several of my apps — dependency updates, crash-report triage, that sort of thing — and one task whose runbook said "runs twice a week" was, according to the scheduler, running every single day.
I had changed the frequency myself a few weeks earlier. I updated the scheduler. I forgot to update the runbook. And because the task printed a cheerful success log every night, nothing ever prompted me to look.
That particular mismatch was harmless. But once I started digging, I found two uglier ones. A data file the runbook referenced had been silently reading as empty ever since a folder rename. And one "zombie task" was still running on schedule even though its instruction document had been deleted in a reorganization.
A scheduled agent does not stop when its instructions diverge from reality. Now that Antigravity 2.0 has made scheduled and background agents an everyday tool, I have come to treat this "instruction drift" as a design problem that needs detection machinery — not a writing problem that careful documentation will solve.
Instructions start aging the moment you write them
After running these pipelines for a while, I noticed that drift arrives through essentially three paths.
1. Stranded definition changes. You change a schedule's frequency, time, or enabled state, but the runbook or AGENTS.md keeps describing the old behavior. The main goal of the change was changing the behavior, so syncing the docs gets deferred — and in my experience, deferred syncs almost never happen.
2. Moved or renamed references. A data file or sub-document the runbook reads gets relocated during a refactor. The scary part is that many shell-based procedures do not stop on an empty read. Depending on how pipes and redirects are arranged, the whole job still "succeeds." And agents lean toward completing the work with whatever information is at hand rather than reporting that something is missing.
3. Vanished documents. You consolidate instruction documents and an old schedule definition survives on its own — a zombie task. The mirror image also appears: orphan documents that no definition references anymore. Each instance is small, but uninventoried they accumulate until you can no longer say how trustworthy the operation is.
What all three share is this: a success log proves nothing about alignment. An agent does its best with the situation it is given, so it produces plausible output even with missing references and stale instructions. "It's running, so it's fine" is exactly the assumption this problem exploits.
Definition, documentation, reality — a three-layer model
To design countermeasures, I split the operation into three layers.
Layer 1: Definition — what the scheduler actually runs, when, and whether it is enabled
Layer 2: Documentation — what AGENTS.md, runbooks, and procedures claim
Layer 3: Reality — what execution logs and artifacts show
Integrity checking decomposes into the three pairwise comparisons: definition versus documentation, documentation versus reality, definition versus reality. My "twice a week versus daily" was a definition–documentation mismatch; the empty reads were documentation–reality; the zombie task was a missing link between definition and documentation.
Before comparing anything, one design decision matters: decide which layer is canonical. I treat the definition (layer 1) as truth and documentation as subordinate. The scheduler is what actually runs; it does not lie. Without a declared source of truth, every discovered mismatch triggers a debate about which side to fix.
Then make the canonical layer machine-readable in one place. If frequencies exist only inside a scheduler's admin UI, cross-checking cannot be automated, so I keep a tasks.yaml in the repository as the single source of truth.
# tasks.yaml — the canonical record of scheduled runs (frequencies live here and nowhere else)tasks: - name: nightly-dependency-update schedule: "0 3 * * *" # daily at 03:00 doc: docs/agents/nightly-dependency-update.md refs: - data/allowlist.json - docs/shared/update-policy.md - name: crash-triage schedule: "0 6 * * 1,4" # Mon & Thu at 06:00 doc: docs/agents/crash-triage.md refs: - data/crash-thresholds.yaml
The important fields are doc and refs. Recording which document a task obeys and which files it reads lets the checker script walk all three layers from a single starting point.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Understand the three typical paths instruction drift takes — stranded definition changes, moved references, and vanished documents — and where to detect each one
✦Take home working bash and Python scripts that cross-check definitions, documentation, and reality, ready to drop into your own operation
✦Learn how to run a weekly drift review in ten minutes by letting machines enumerate problems and showing humans only the diff
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
The drift that took me longest to discover was the empty reference read. When I reorganized a data folder and renamed a directory, every cat path inside the runbook went stale — yet the nightly job never stopped, and the agent kept producing plausible-looking reports without the reference data. It took me two weeks to notice the output had gone thin.
Writing "abort and report if the output is empty" into the instructions is not enough, because whether the agent honors that sentence is itself probabilistic. Instead, route every reference read through a wrapper that records failures.
#!/usr/bin/env bash# require_ref.sh — log missing or empty reference files to a drift logDRIFT_LOG="${DRIFT_LOG:-$HOME/.agent-drift.log}"require_ref() { local path="$1" if [ ! -s "$path" ]; then printf '%s missing-or-empty %s\n' "$(date -Is)" "$path" >> "$DRIFT_LOG" echo "WARNING missing reference: $path" >&2 return 1 fi cat "$path"}# In runbooks, always go through this instead of bare catrequire_ref "data/crash-thresholds.yaml" || exit 1
I use -s because it rejects both "does not exist" and "exists but empty" in one test. And the failure is appended to a log file, not just printed to stderr: stderr from a scheduled run scrolls away forever, while the log file is picked up by the next morning's check.
Automating the definition–documentation cross-check
The core of the three-layer comparison is a single Python script that starts from tasks.yaml and verifies four things: the task's document exists (zombie detection), every reference exists and is non-empty, frequency wording in the document does not contradict the cron expression, and no orphan documents exist outside the definitions.
#!/usr/bin/env python3"""drift_check.py — detect drift between tasks.yaml, documents, and references"""import pathlibimport sysimport yamlROOT = pathlib.Path(__file__).resolve().parent# Frequency vocabulary allowed in documents, with its consistency rule against cronFREQ_RULES = { "daily": lambda c: c.split()[4] == "*", "weekly": lambda c: c.split()[4] != "*" and "," not in c.split()[4], "twice a week": lambda c: c.split()[4].count(",") == 1,}def load_tasks(): with open(ROOT / "tasks.yaml", encoding="utf-8") as f: return yaml.safe_load(f)["tasks"]def check(tasks): problems = [] all_docs = {p.resolve() for p in (ROOT / "docs" / "agents").glob("*.md")} used_docs = set() for t in tasks: doc = (ROOT / t["doc"]).resolve() used_docs.add(doc) if not doc.is_file(): problems.append(f"[zombie] {t['name']}: document missing -> {t['doc']}") continue text = doc.read_text(encoding="utf-8").lower() for word, ok in FREQ_RULES.items(): if word in text and not ok(t["schedule"]): problems.append( f"[freq] {t['name']}: doc says '{word}' / definition is {t['schedule']}") for ref in t.get("refs", []): rp = ROOT / ref if not rp.is_file() or rp.stat().st_size == 0: problems.append(f"[ref] {t['name']}: reference missing or empty -> {ref}") for orphan in sorted(all_docs - used_docs): problems.append(f"[orphan] document not referenced by any definition: {orphan.name}") return problemsif __name__ == "__main__": found = check(load_tasks()) for line in found: print(line) sys.exit(1 if found else 0)
The frequency matching is plain vocabulary lookup, and that may feel underpowered. It is deliberate. Trying to fully interpret free-form prose makes the checker complex enough that the checker itself stops being maintained.
Instead I pair it with a writing rule: if a document mentions frequency at all, it must use only the vocabulary in FREQ_RULES — and preferably it should not mention frequency. The canonical frequency lives in tasks.yaml, so the document can simply say "see the definition for the schedule." Checkability is far cheaper to buy by constraining the writers' vocabulary than by making the checker smarter.
Cross-checking the reality layer — artifact freshness
Definition-versus-reality drift — a task that is supposedly active but whose artifacts have gone stale — falls out of the artifact directory's modification times.
# List artifact freshness for every task that the definition says is activepython3 - << 'EOF'import pathlib, time, yamltasks = yaml.safe_load(open("tasks.yaml", encoding="utf-8"))["tasks"]now = time.time()for t in tasks: out = pathlib.Path("outputs") / t["name"] files = sorted(out.glob("*"), key=lambda p: p.stat().st_mtime, reverse=True) if not files: print(f"[stale] {t['name']}: no artifacts at all") continue age_h = (now - files[0].stat().st_mtime) / 3600 flag = "[stale]" if age_h > 48 else "[ok] " print(f"{flag} {t['name']}: newest artifact {age_h:.0f}h old")EOF
The 48-hour threshold is the value from my own operation — for a daily task it means "two consecutive missed runs counts as an anomaly." If you mix daily and weekly tasks, extend tasks.yaml with a max_age_hours field per task so the threshold comes from the definition too, keeping the check consistent.
The weekly drift review — machines enumerate, humans read the diff
Once the scripts exist, the operating routine is simple. Run drift_check.py and the freshness check every morning from the scheduler itself, appending results to a dated history file. A human looks at it once a week — and only at what changed since last week.
# Accumulate the morning check into history (this too is a scheduled run){ date -I; python3 drift_check.py; echo "---"; } >> drift-history.log
When new lines appear, the fix direction is never in question: the definition is canonical, so documentation is amended to match it. When it turns out the definition itself was wrong, the definition fix and the documentation fix go into the same commit — split them across two commits and a scheduled run can land in between, executing against a half-corrected state.
In my operation, the first week after introducing this surfaced seven mismatches: leftovers from renames months earlier, reference paths that no longer existed, all quietly accumulated. Since then it has settled at zero or one finding per week, and the review takes under ten minutes. The biggest gain was not the numbers, though — it was no longer carrying the vague unease of wondering whether the runbooks could be trusted at all.
Start by finding one mismatch tonight
Before building any of this, try one thing before tonight's scheduled runs kick off. Put your scheduler's task list next to your runbooks and cross-check just one attribute — frequency, time, or a reference path. If your operation has been running for any length of time, you will very likely find a discrepancy.
Once you have found that first one, consolidating the source of truth into a tasks.yaml is the shortest path, even though it looks like a detour. A check cannot be automated without a canonical record, and a check that is not automated will not keep happening — that has been the clearest lesson of these past months.
If you also hand your overnight work to agents, I hope this saves you a few of the silent failures it cost me to learn.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.