Letting a Background Agent Work Overnight Without Regretting It by Morning — Guardrails for Unattended Runs

When you hand overnight refactoring to Antigravity's Background Agent, the morning brings as much anxiety as convenience. From three angles — blast radius, completion criteria, and detecting silent regressions — here are the guardrails that let me run unattended jobs with confidence.

antigravity³⁸⁵ background-agent⁸ ai-agent¹⁵ automation⁵⁸ code-review⁶

✦ Premium Article

The first time overnight automation burned me, the agent was being helpful. It found two near-duplicate utility functions and merged them into one. The trouble was that one version assumed callers swallowed errors, while the other threw exceptions. Every test stayed green, the diff read cleanly, and the commit message was polite. Trusting all of that, morning-me merged it — and that afternoon a production job quietly stopped.

The danger of unattended work isn't that it fails loudly. It's that it looks like it succeeded. If a human is in the review loop, that kind of mix-up surfaces in conversation. But a Background Agent running overnight makes confident mistakes at an hour when nobody is watching and nobody can stop it. Below are the design choices that, across roughly thirty unattended nights with Antigravity's Background Agent, actually reduced the damage. This isn't a story about impressive automation — it's a record of the unglamorous mechanisms that keep you from regretting things by morning.

Decide the blast radius before the scope of work

When designing a nightly task, it's tempting to start from "what should it do." I had the order backwards. The first thing to settle is how much can break if it fails — the blast radius.

Maintaining several repositories alone as an indie developer, you start wanting to hand the agent the upkeep you never reach at night. The principle I landed on is simple: fence the agent in three ways. First, never let it touch main; it always works on a disposable branch. Second, name the files it may rewrite in an allowlist, and forbid writes anywhere else. Third, cap the amount of diff a single run may produce. All three work regardless of how clever the agent is.

I write those fences directly into the task definition.

# .antigravity/tasks/nightly-maintenance.md
 
## Goal
Improve readability and type safety of allowlisted files WITHOUT changing behavior.
 
## Where you may write (writing anywhere else is forbidden)
- src/lib/**/*.ts
- src/utils/**/*.ts
 
## Never do this
- Change any public API signature (params, return type, exception kind)
- Commit directly to main / develop
- Add or update dependencies (package.json is read-only)
- Produce more than 120 lines of diff per file
 
## Definition of done
- Existing tests stay green
- For each changed file, one line stating what changed and why
- If you can't satisfy the above, finish with "no change" and record the reason

What matters is giving "Never do this" and "Definition of done" the same weight as the goal itself. An agent optimizes toward its objective, so if you leave the constraints fuzzy, it will sacrifice the constraints to reach the goal. After getting a public API changed out from under me, I now spend an explicit line forbidding signature changes.

Fence it at launch too — don't over-trust the prompt

Constraints in the task definition are still only a request. Since this runs unattended, I also keep mechanically enforceable fences in the launch script. It wakes a session through the Antigravity CLI and cuts a working branch before handing anything over.

#!/usr/bin/env bash
# scripts/nightly-agent.sh — invoked by cron, one task per launch
set -euo pipefail
 
TASK_FILE="$1"                       # e.g. .antigravity/tasks/nightly-maintenance.md
DATE="$(date +%Y%m%d)"
BRANCH="agent/nightly-${DATE}-$(basename "$TASK_FILE" .md)"
 
# Disposable working branch. main stays a clean starting point.
git fetch origin main --quiet
git switch -c "$BRANCH" origin/main
 
# Start the session; receive the result as JSON when done.
SESSION_ID="$(antigravity sessions create \
  --task "$TASK_FILE" \
  --sandbox isolated \
  --timeout 45m \
  --max-output-tokens 64000 \
  --format json | jq -r '.session_id')"
 
echo "started ${SESSION_ID} on ${BRANCH}"
 
antigravity sessions wait "$SESSION_ID" --timeout 50m || {
  echo "session did not finish cleanly: ${SESSION_ID}"
  # Throw away half-done work, branch and all.
  git switch main && git branch -D "$BRANCH"
  exit 0
}

Two things matter here. Always set a timeout. And discard, without inspection, the output of any session that didn't finish cleanly. The time it takes to pick through half-processed code in the morning costs more than just re-running the job. Early on I'd salvage partial work because throwing it away felt wasteful; it was almost never usable, and all it left me was decision fatigue.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Concrete patterns for bounding unattended overnight refactors by blast radius: throwaway branches, a file allowlist, and a hard diff ceiling

✦Why 'tests are green' is the wrong completion gate, and how a semantic-diff and coverage-delta layer catches quiet regressions

✦A morning triage routine that clears several generated branches in minutes, plus idempotent task definitions that survive re-runs without double-applying

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Don't make "tests are green" the completion gate

As the opening failure shows, test results alone won't catch a quiet regression. The thinner the tests in an area, the more freely the agent can behave there. So before deciding whether a branch is mergeable, I run it through a light inspection layer. The aim is to confirm, from an angle other than tests, that behavior really hasn't changed.

What I actually use is a gate that inspects the shape of the change.

#!/usr/bin/env python3
# scripts/diff_guard.py — mechanically inspect an overnight branch's diff
import subprocess, sys, re
 
BASE = "origin/main"
MAX_FILE_LINES = 120          # keep this in sync with the task definition
PUBLIC_SIG = re.compile(r'^[+-]\s*export\s+(async\s+)?function\s+\w+')
 
def changed_files(branch):
    out = subprocess.run(
        ["git", "diff", "--numstat", f"{BASE}...{branch}"],
        capture_output=True, text=True, check=True).stdout
    return [l.split("\t") for l in out.splitlines() if l.strip()]
 
def main(branch):
    problems = []
    for added, removed, path in changed_files(branch):
        if added == "-" or removed == "-":          # binary
            problems.append(f"binary touched: {path}")
            continue
        if int(added) > MAX_FILE_LINES:
            problems.append(f"{path}: +{added} lines exceeds cap {MAX_FILE_LINES}")
        d = subprocess.run(["git", "diff", f"{BASE}...{branch}", "--", path],
                           capture_output=True, text=True).stdout
        sig_changes = [l for l in d.splitlines() if PUBLIC_SIG.match(l)]
        # paired +/- likely just a rename; an odd count is suspicious
        if len(sig_changes) % 2 != 0:
            problems.append(f"{path}: possible one-sided public signature change")
    if problems:
        print("BLOCK:")
        for p in problems:
            print("  -", p)
        sys.exit(1)
    print("diff_guard: ok")
 
if __name__ == "__main__":
    main(sys.argv[1])

This gate does nothing clever. It looks at two things: the size of the diff, and whether changes to public functions come in matched pairs. Even so, the kind of change that bit me in the opening — a function that disappeared from only one side — stops right here. The moment you try to reason about meaning, the check itself becomes brittle, so for unattended operation I settled on catching only anomalies that can be judged mechanically. That trade kept things stable.

I also watch the coverage delta. If coverage dropped after the change, the agent probably touched a poorly-tested area — or loosened a test to "make it pass." I compare vitest run --coverage against the prior night's number, and if it clearly fell, that branch goes on hold.

Idempotent task definitions that survive a re-run

Nightly jobs fail: sandbox timeouts, transient network errors, a CLI update. When they fail, I want to re-run. But a naive setup will apply the same formatting twice, or split an already-split function again — double application.

The way around it is to write the task as a declaration that "brings the current state closer to the target state." Not "split this function," but "this function should have a single responsibility; if it already does, do nothing." Because the agent observes the current state before acting, you can run the same task any number of times and it converges. On the launch side, one extra line that skips a new run when today's same-named branch already exists made cron's duplicate triggers stop causing accidents.

# add near the top of nightly-agent.sh
if git ls-remote --exit-code --heads origin "$BRANCH" >/dev/null 2>&1; then
  echo "branch ${BRANCH} already exists — skip"
  exit 0
fi

Finish the morning triage in a few minutes

The payoff of unattended runs only materializes when the morning check is light. If three or four branches arrive every night, reading each one carefully won't last. I look at "the machine's first verdict" first, and reserve human attention for the ones that actually need judgment.

#!/usr/bin/env bash
# scripts/morning-review.sh — clear overnight branches at a glance
git fetch origin --quiet
for b in $(git branch -r | grep "origin/agent/nightly-$(date +%Y%m%d)"); do
  name="${b#origin/}"
  files=$(git diff --name-only origin/main..."$b" | wc -l | tr -d ' ')
  if python3 scripts/diff_guard.py "$b" >/dev/null 2>&1; then
    verdict="REVIEW"          # passed the gate; on to human judgment
  else
    verdict="REJECT"          # blocked by the gate; discard by default
  fi
  printf "%-8s %-45s files=%s\n" "$verdict" "$name" "$files"
done

I don't open REJECT branches unless there's a strong reason — going to read them pulls you toward "but it's so close" and melts your time. I look only at REVIEW ones, leaning on the per-file summaries the task definition required, and merge when convinced, discard when unsure. Making discard-when-unsure the default keeps overnight automation an asset rather than a source of decision fatigue. If a discarded change is truly needed, it gets proposed again the next night.

What thirty nights actually produced

There were no dramatic wins. Of the branches generated each night, I merge a little over half. The rest get blocked by the diff gate or discarded at triage because I can't be sure behavior is unchanged. Even so, the steady drip of "maintenance I knew I should do but kept postponing" — filled-in comments, small function splits, added type annotations — making progress on its own was a real benefit.

The biggest shift might be where I now place my trust in the codebase. I used to hope the agent would fix things cleverly. Now I design so that even when it errs cleverly, the damage stays inside the fence. What you should demand of something that runs unattended isn't a ceiling on its capability but a shallow floor under its failures. It took a few cold-sweat mornings before that finally sank in.

If you want to try overnight automation, start the first night with an absurdly narrow fence: one file allowed, fifty lines of diff at most. It should feel too cramped to be useful — that's about right — and as you widen the fence gradually, you'll naturally come to see which parts of your codebase you can hand to an agent. Thanks for reading.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.