Three Boundaries I Draw Before Handing Work to an Antigravity 2.0 Agent

What to hand a background agent, and what to keep in your own hands. The three boundaries I actually drew while running solo-dev automation in parallel, and how to encode them so the lines hold.

antigravity³⁷⁰ agents⁹³ automation⁵⁰ workflow⁴²

✦ Premium Article

On June 18, Gemini CLI stopped serving requests for individual and AI Pro / Ultra users. I had a few routine jobs running on Gemini CLI, so I spent the day moving them onto Antigravity 2.0 background agents.

The first thing I ran into wasn't "how do I move this." It was "should this run unattended at all?" Antigravity 2.0 can take a single prompt all the way from plan to implementation, test, and deploy, and it runs several agents in parallel. The wider the surface you can delegate, the more deliberately you have to mark the surface you must not — otherwise convenience quietly becomes your failure rate.

On the very first day I had a near miss. One agent, while "cleaning up artifacts that were no longer needed," lined up a cache that should not have been touched as a deletion candidate. I had a human checkpoint just before execution, so nothing broke — but if that step had been automatic, I would never have noticed. That moment convinced me to write down, in just three lines, what I would not delegate before tuning how I delegate the rest. This article is that record.

Why the "don't delegate" line comes before the "how to delegate" one

The better your automation runs, the less you look at it. After ten clean runs in a row, you want to wave the eleventh through unseen. But the incident always happens on that unseen eleventh run.

So the safety of automation isn't set by its success rate on a good day. It's set by where it stops on a bad one, and by who notices. That's why the starting point of the design should be "where does this hand back to a human," not "how do I run it faster." Draw the boundaries first, and you can widen the automated surface with confidence. Do it in the wrong order and you tend to get scared after the fact and roll everything back to manual — the long way around.

Boundary 1 — Let the agent prepare irreversible actions, but keep the trigger

The first line splits actions by whether they can be undone. Running tests, building, drafting output — you can redo those if they go wrong. Pushing, deploying to production, deleting, publishing — once those run, rolling back costs you separately.

In my setup the agent owns everything up to and including reversible actions. For irreversible ones it stops at "prepared." It builds the diff, tidies the commit message, lays out the output for review — and a human pulls the trigger. Stated as a verbal rule this always erodes, so I pin it in code.

# Classify an action by whether it can be undone,
# and never let the agent run irreversible ones itself.
IRREVERSIBLE = {"push", "deploy", "delete", "drop", "publish", "purge"}
 
def classify(action: str) -> str:
    verb = action.strip().split()[0].lower()
    return "irreversible" if verb in IRREVERSIBLE else "reversible"
 
def gate(action: str) -> dict:
    kind = classify(action)
    if kind == "irreversible":
        # The agent prepares; the human triggers the final run.
        return {"action": action, "auto_run": False, "needs_human": True}
    return {"action": action, "auto_run": True, "needs_human": False}
 
if __name__ == "__main__":
    for a in ["build site", "git push origin main", "run tests", "deploy production"]:
        r = gate(a)
        flag = "needs-human" if r["needs_human"] else "auto"
        print(f"{flag:12} {a}")
 
# Output:
# auto         build site
# needs-human  git push origin main
# auto         run tests
# needs-human  deploy production

The point is not to make the classifier perfect. When a verb is ambiguous, the ambiguity itself sends it to the irreversible side. Erring safe costs you one extra confirmation; erring unsafe costs you a broken production. Those aren't the same size. For untangling the dependencies the Gemini CLI shutdown leaves behind, I wrote up an audit of automation dependencies you can read alongside the migration.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦If you've been unsure what to delegate to a background agent and what to keep, you'll be able to draw the line with three concrete criteria

✦You can drop a 20-line guard into your own pipeline that stops irreversible actions just before they run

✦You'll avoid the failure mode of trusting an agent's self-reported 'done' by pinning completion to observable stop conditions in code

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Boundary 2 — If you can't state the reason for "correct" in one sentence, don't delegate it

The second line is about the nature of the task. Before handing anything off, I check whether I can write the reason a result is correct in a single sentence.

"Proceed if the tests are green" has a clear reason: there's an observable criterion, green or red. "Polish the prose so it reads well" does not — the standard lives only in my head. Hand a task with an unstateable standard to an agent and it returns something plausible, but you end up reading every output to check it against your intent. That hollows out the reason you automated in the first place.

When I can't state the reason in one sentence, I take it as a sign the standard isn't settled in my own mind yet. Rather than force the delegation, I do the task by hand a few times, put the standard into words, reduce it to observable conditions, and only then hand it over. Keep that order and the stop conditions below tend to write themselves.

Boundary 3 — If detecting failure costs more than doing the work, do it yourself

The third line is a cost comparison. Whether to delegate is a trade between the work you offload and the burden of finding its failures.

Some tasks execute in an instant but take three times as long to verify. Hand off a config change that spans several files and the change is fast, but the review — "did it touch anything it shouldn't have?" — is slow. When detection costs more than execution, doing it yourself is faster in the end.

Gut feeling makes this comparison too generous, so before delegating I ask: "if this were wrong, how many minutes until I'd notice?" Only tasks with a fast tell — a CI failure alert, a cap on diff line count, a secret-scanning check — go inside the automated loop. Tasks without that machinery stay outside it. Delegating with no detection in place is the same as automating with your eyes shut.

How the split actually came out

Here's how the three boundaries sorted the tasks I was migrating. This is the judgment of an indie developer running several sites in parallel, so the lines move with scale — but the reasoning may be useful.

Task	Degree of delegation	Reason
Drafting articles	Fully delegated	Reversible; the standard is observable via the quality gate
Running tests / builds	Fully delegated	Result is clear (green/red); failures surface instantly
Diff creation / commit shaping	Prepared only	Human checks just before push
Production deploy / publish	Human pulls trigger	Irreversible; falls under Boundary 1
Deleting / purging artifacts	Candidates only	Mis-deletion is expensive to detect; Boundary 3
Tuning prose "readability"	Not delegated	Can't state the reason in one sentence; Boundary 2

Building the table, what struck me was that the answer wasn't a binary of "fully delegated" or "not delegated" — the middle, "prepared only," was the largest column. The value of an agent isn't taking the final decision away from you; it's getting everything ready right up to that decision.

Pin completion in code — don't leave "done" to the agent

Once the boundaries are drawn, take back the definition of "done." The most dangerous thing about a background agent is that the agent itself reports "finished." Take that at face value and you miss the conditions it didn't meet.

So judge completion by observable conditions, not self-report. Only when every stop condition is met is the task "done"; if even one is missing, it goes back to a human review queue.

# Don't leave "done" to the agent's self-report;
# judge it by observable conditions.
from dataclasses import dataclass
 
@dataclass
class StopConditions:
    tests_passed: bool         # is CI green?
    diff_within_budget: bool   # are changed lines within the expected range?
    no_new_secrets: bool       # no keys or tokens slipped in?
 
def is_done(c: StopConditions) -> bool:
    # only "done" when every condition holds
    return c.tests_passed and c.diff_within_budget and c.no_new_secrets
 
def review(c: StopConditions) -> str:
    if is_done(c):
        return "done"
    # if any condition is missing, hand back with the gap named
    missing = [k for k, v in c.__dict__.items() if not v]
    return "needs_review: " + ", ".join(missing)
 
state = StopConditions(
    tests_passed=True,
    diff_within_budget=False,
    no_new_secrets=True,
)
print(review(state))
# Output: needs_review: diff_within_budget

The nice property here is that "done" becomes a set of reproducible conditions rather than my subjective sense. Add or drop a condition and anyone reading the code can see what counts as complete at a glance. I dig into verifying an agent's completion report in how to confirm a background agent is actually done, and for putting a cost ceiling on disposable workers, capping ephemeral worker cost is a useful companion.

What surprised me in practice

After about two months on these three boundaries, three things ran against my expectations.

As "prepared only" grows, so does the number of confirmations. It's safer, but the human checkpoint becomes the bottleneck. I dealt with this by batching confirmations into two windows a day, letting the agent accumulate prepared work in between.
Tasks rejected by Boundary 2 become delegable once the standard is named. Even "readability tuning" crossed inside once I broke it into observable conditions like "keep sentences under ~20 words" and "no more than three passive constructions." The line isn't fixed; it moves as your own understanding sharpens.
Budget as much time for the detection machinery as for the automation itself. I assumed automating the execution was the whole job; in reality, building the "notice failure fast" side took longer. Underweight it and Boundary 3 simply doesn't work.

Looking back, designing where things stop did more to widen the surface I could safely delegate than adding more agents ever did.

A next step

Start by writing your pipeline's tasks on paper and filling in three columns for each: "can it be undone," "can I state the reason for correct in one sentence," and "how many minutes until I'd notice a failure." Just filling those three reveals where the boundaries belong. Encoding them comes after.

I'm still redrawing my own lines as I go, but I hope this gives you something concrete to weigh when you're unsure how far to automate. Thanks for reading.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.