When Background Agents Run Twice — Stopping Double Execution with Leases and Fencing Tokens
The same scheduled job fires from two machines at once and they overwrite each other's output. Here is how to stop that failure mode at the root in Antigravity 2.0 background agents, using leases and fencing tokens, with working code.
One morning, an artifact that should have been freshly generated was half-overwritten with stale content. The logs told the story: the same scheduled job had fired from two machines at nearly the same minute, and one started writing before the other had finished.
As an indie developer at Dolice Labs, I run several blog operations on background agents. The intent was redundancy — while one machine sleeps, another picks up the work. But in the instant both were awake, both grabbed the same job. This is not a "forgot to take the lock" story. It happens even when you do take the lock. Let me walk through why, and how to stop it.
Why mutual exclusion alone does not stop double execution
The intuition is simple: take a lock at the start of the job, release it at the end. In the world of background agents, that premise quietly collapses.
A process can hold a lock and then stall for a long time — a long garbage-collection pause, an OS wake from sleep, a slow model call. Meanwhile the lock's TTL expires and another machine legitimately acquires it. The first machine wakes up still believing it owns the lock and begins writing. At that moment there are two lock holders in the world.
So the problem is not acquiring exclusivity; it is being unable to guarantee continued possession. Miss this distinction and every patch — longer TTLs, more heartbeats — only lowers the probability instead of eliminating it.
The idea of a lease
Reframe the lock as a lease: time-bounded ownership that is always assumed to expire. The holder must explicitly renew it before it lapses. The moment renewal stops, ownership is considered surrendered automatically.
The key property is that each time a lease is granted, the issued fencing token increases monotonically. Every acquisition produces a strictly larger token. That lets you distinguish an "old owner" from a "new owner" right before a write, using nothing but a numeric comparison.
Aspect
Plain mutual-exclusion lock
Lease + fencing token
Recovery from a stall
The old holder writes anyway
The old token is rejected at the write site
Ownership decision
Relies on "I think I hold it"
Decided mechanically by number size
Effect of clock skew
Breaks if TTL judgment drifts
Order is set by the token, not the clock
Required assumption
Everyone behaves honestly
The write target can verify the token
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Understand exactly why 'I acquired the lock, so I'm safe' breaks down (GC pauses, sleep/wake, clock skew) — and why a plain mutual-exclusion lock cannot prevent double execution
✦A complete implementation of double-execution prevention with a lease plus a monotonically increasing fencing token (acquire, renew, expire, and write-side verification). The bash and Python are copy-paste ready
✦From the real experience of running scheduled jobs across two Macs as an indie developer and watching output get corrupted, a clear rule for where the verification gate must live
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
First, represent the lease on shared storage. For clarity I show a file-based version (a shared directory or a small KV). The essentials: increase the token monotonically on acquire, and keep renewing via heartbeat while you hold it.
#!/usr/bin/env bash# acquire_lease.sh — acquire a lease (exit non-zero if not acquired)set -euo pipefailLEASE_DIR="${LEASE_DIR:-/shared/leases}"JOB="$1" # job nameHOLDER="$(hostname)-$$" # machine + PIDTTL_SEC="${TTL_SEC:-120}" # lease lifetimeNOW="$(date +%s)"LEASE="${LEASE_DIR}/${JOB}.lease"TOKENF="${LEASE_DIR}/${JOB}.token"mkdir -p "$LEASE_DIR"# Is an existing lease still alive?if [ -f "$LEASE" ]; then EXP="$(awk -F= '/^expires=/{print $2}' "$LEASE")" if [ "${EXP:-0}" -gt "$NOW" ]; then echo "lease held until ${EXP}, now ${NOW}" >&2 exit 11 # still valid -> give up acquiring fifi# It had expired, so acquire. Increase the token monotonically (this is the point).TOKEN="$(( $(cat "$TOKENF" 2>/dev/null || echo 0) + 1 ))"echo "$TOKEN" > "$TOKENF"cat > "${LEASE}.tmp" << LEASEDATAholder=${HOLDER}token=${TOKEN}expires=$(( NOW + TTL_SEC ))LEASEDATAmv -f "${LEASE}.tmp" "$LEASE" # atomic swapecho "$TOKEN" # return the acquired token
The atomic swap via mv -f and the monotonic token file are what matter. The token is "how many times the lease has been freshly granted," and it never decreases.
While holding, a separate process (or a background loop in the same one) sends heartbeats, pushing expires forward.
#!/usr/bin/env bash# renew_lease.sh — extend the TTL only if you are still the holderset -euo pipefailLEASE="${LEASE_DIR}/$1.lease"; HOLDER="$2"; MY_TOKEN="$3"; TTL_SEC="${TTL_SEC:-120}"CUR_HOLDER="$(awk -F= '/^holder=/{print $2}' "$LEASE")"CUR_TOKEN="$(awk -F= '/^token=/{print $2}' "$LEASE")"# If either the holder or token differs, you no longer own itif [ "$CUR_HOLDER" != "$HOLDER" ] || [ "$CUR_TOKEN" != "$MY_TOKEN" ]; then echo "lost lease (holder/token changed)" >&2 exit 12ficat > "${LEASE}.tmp" << LEASEDATAholder=${HOLDER}token=${MY_TOKEN}expires=$(( $(date +%s) + TTL_SEC ))LEASEDATAmv -f "${LEASE}.tmp" "$LEASE"
When renewal returns exit 12, stop the work right there. Writing an artifact after losing the lease is exactly the moment double execution corrupts output.
Verifying the fencing token at the write site
This is the heart of the design. Holding the lease does not mean "you may write." Write only when the target confirms your token is at least the largest it has seen. That reliably rejects an old holder returning from a stall.
# fenced_write.py — write an artifact with token verificationimport os, tempfiledef fenced_write(target_path: str, token: int, payload: bytes) -> bool: """Write only if token is >= the largest token observed so far.""" guard = target_path + ".fence" # records the last token written last = 0 if os.path.exists(guard): last = int(open(guard).read().strip() or "0") if token < last: # Reject a write from an old holder (a machine returning from a stall) raise PermissionError(f"stale token {token} < last {last}; refusing write") # Advance the fence first, then atomically replace the body with open(guard, "w") as g: g.write(str(token)) fd, tmp = tempfile.mkstemp(dir=os.path.dirname(target_path) or ".") with os.fdopen(fd, "wb") as f: f.write(payload) os.replace(tmp, target_path) # atomic swap return True
By rejecting token < last, even a stale machine that still believes it is the holder cannot land its write. Deciding ownership at the instant of the write, using only a numeric comparison, is the strength of this pattern.
Wrapping the whole job into one flow
Lining up acquire, renew, and verified write, the agent's job body wraps like this.
TOKEN="$(LEASE_DIR=/shared/leases ./acquire_lease.sh daily-publish)" || { echo "another holder is running; exiting cleanly"; exit 0; }HOLDER="$(hostname)-$$"# heartbeat in the background( while sleep 45; do ./renew_lease.sh daily-publish "$HOLDER" "$TOKEN" || exit; done ) &HB=$!trap 'kill "$HB" 2>/dev/null || true' EXIT# do the agent's real work here (generate, format)run_agent_job# the artifact write must go through token verificationpython3 fenced_write.py --token "$TOKEN" --target /shared/out/today.json
If acquisition fails, exit cleanly (exit 0). It simply was not your turn, so there is no reason to raise an error. An operation where error alerts keep ringing buries the truly abnormal ones.
A small judgment that paid off in solo operation
Running scheduled jobs across two Macs taught me how much it matters to place the verification gate as far downstream as possible. Early on I was content with a lock at job start and did nothing right before the write. Accidents always happen in the long gap between start and write.
The other lesson: keep the fence file (.fence) in the same place and with the same permissions as the artifact. Put it elsewhere and one side may sync while the other does not, rewinding the token memory and turning the verification itself into a lie. In the Dolice Labs setup, things stabilized once I always co-located artifact, fence, and lease on the same shared target, with a small rule to exclude only the fence from backups.
Deciding how far to take it
Not every job needs this. For idempotent read-only jobs, or ones where "last writer wins" is fine, a lease is overkill. The cost is worth it when three conditions overlap: artifacts that corrupt if written halfway, schedules that can fire from two or more machines, and devices where stalls and sleep happen daily.
Conversely, when all three are present and you run on a plain lock, the accident is only a matter of time. Rather than lowering the probability, make old writes structurally unreachable — that, I believe, is the role of leases and fencing tokens.
As a first step, add fence verification to just the single artifact whose corruption hurts the most. Run it overnight, watch the number in .fence climb quietly, and you will learn whether double execution had been happening at all.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.