Stop Letting Antigravity Agents Self-Report 'Done' — Completion Contracts and External Verification

An Antigravity agent reporting 'done' when the work was not actually finished is a failure mode I kept hitting. Moving the completion decision out of the agent and into code fixed it. Here is the contract, a three-layer verifier, and how it holds up under unattended, scheduled runs.

Antigravity²³⁴ agent design⁵ completion verification quality assurance³ unattended ops

✦ Premium Article

I asked Antigravity to move a payment module onto a new SDK. It came back with "Done — build and tests pass." The diff did swap in the new SDK calls, and the unit tests were green. But when I ran the production-like integration suite with retries, one timeout path was still catching the old SDK's exception type, so recovery never fired.

The agent did not lie. Its definition of "done" and mine were simply different from the start. This article is about pinning the definition of completion outside the agent — a "completion contract" — and the automated verification that enforces it, with the code I actually run. As an indie developer running several sites in parallel, I lean on agents heavily; in 2026, with scheduled and unattended runs now routine, whether you have this design directly shapes operating cost.

Why "done" drifts structurally

When you let the agent judge completion, three layers quietly blur together: formal completion (files changed, functions added as instructed), functional completion (it behaves as expected at runtime), and intentional completion (the real goal is met and nothing else broke). When the instruction is vague, the agent stops at the fastest reachable point — formal completion. Build passes, tests go green, the prompt's bullet list is filled in, and that becomes the stopping point.

The real problem is that the stopping point lives inside the agent. No matter how capable the model gets, as long as the yardstick is held by the other party, the mismatch is structural. The fix is simple: take the right to declare completion away from the agent so only an external verifier can emit "done." The agent's job becomes "turn the verifier green," and I am the one who writes the verifier.

Fix the completion contract first

For each task I write what counts as done as a machine-readable contract, before starting, in the same commit. This matters: if you write it after the agent produces a diff, the contract inevitably gets loosened to match that diff.

# tasks/payment-sdk-migration/contract.yaml
task_id: payment-sdk-migration
# only this file holds the right to declare completion
formal:
  - "pnpm lint"
  - "pnpm tsc --noEmit"
  - "! grep -rn 'legacy-pay' src --include='*.ts'"   # no old SDK name left
functional:
  - "pnpm test src/payment"
  - "pnpm test:contract -- --grep 'timeout-recovery'" # the path that broke, now required
intent:
  questions:
    - id: side_effects
      ask: "Which screens outside payment relied on the old SDK types, and in which file did you confirm it"
    - id: error_paths
      ask: "For timeout, 5xx, and network-drop, which test proves recovery fires"
    - id: untouched
      ask: "Did you change any file you should not have? List all of them"
budget:
  max_files_changed: 18        # over this = possible intent drift, needs review
  forbid_paths: ["infra/", "src/auth/"]  # boundaries the agent must not cross

When I hand the contract to the agent at the top of the task, I nail down the stop condition in one line: "This task is done only when scripts/done_check.py returns exit code 0. Even if build and lint pass mid-way, do not report done until done_check passes. Answer each intent question in tasks/<id>/intent.md, one per question." That single sentence fixes what "done" means.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How to write a machine-readable completion contract that takes the 'done' decision away from the agent, with meaningful exit codes

✦A three-layer verifier (formal, functional, intent) and intent questions that force the agent to write concrete evidence

✦Running completion checks under scheduled and unattended agents: retry limits, diff audits, and quarantine on failure

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Implement the verifier in three layers

done_check.py evaluates the contract, dropping from the shallow, fast layer down: formal, then functional, then intent. I split exit codes — 0 = done, 2 = formal/functional failure, 3 = intent failure, 4 = budget (change-scope) violation — so downstream automation can branch on the reason.

# scripts/done_check.py
import subprocess, sys, yaml
from pathlib import Path
 
EXIT_OK, EXIT_CHECK, EXIT_INTENT, EXIT_BUDGET = 0, 2, 3, 4
 
def sh(cmd: str) -> int:
    print(f"::group::{cmd}")
    rc = subprocess.run(cmd, shell=True).returncode
    print("::endgroup::")
    return rc
 
def changed_files() -> list[str]:
    out = subprocess.check_output(
        "git diff --name-only origin/main...HEAD", shell=True, text=True)
    return [l for l in out.splitlines() if l.strip()]
 
def main(contract_path: str) -> int:
    c = yaml.safe_load(Path(contract_path).read_text())
    task_dir = Path(contract_path).parent
 
    # check budget first; scope drift is a strong intent-drift signal
    files = changed_files()
    budget = c.get("budget", {})
    if (m := budget.get("max_files_changed")) and len(files) > m:
        print(f"BUDGET: {len(files)} files > {m}")
        return EXIT_BUDGET
    for forbidden in budget.get("forbid_paths", []):
        if any(f.startswith(forbidden) for f in files):
            print(f"BUDGET: touched forbidden path {forbidden}")
            return EXIT_BUDGET
 
    # formal & functional; bail on first failure (do not reach intent)
    for layer in ("formal", "functional"):
        for cmd in c.get(layer, []):
            if sh(cmd) != 0:
                print(f"{layer.upper()} FAILED: {cmd}")
                return EXIT_CHECK
 
    # intent: sieve the answers for existence AND concreteness
    intent = c.get("intent", {})
    answers = task_dir / "intent.md"
    if intent.get("questions"):
        if not answers.exists():
            print("INTENT: intent.md missing")
            return EXIT_INTENT
        text = answers.read_text()
        for q in intent["questions"]:
            qid = q["id"]
            block = extract_answer(text, qid)
            if not block or len(block.strip()) < 40 or block.strip() == "Confirmed":
                print(f"INTENT: answer too thin: {qid}")
                return EXIT_INTENT
            if qid in ("side_effects", "error_paths"):
                if not any(tok in block for tok in (".ts", ".py", "test", "spec")):
                    print(f"INTENT: no concrete reference: {qid}")
                    return EXIT_INTENT
 
    print("DONE: all layers passed")
    return EXIT_OK
 
def extract_answer(md: str, qid: str) -> str:
    # grab the block directly under "## <qid>"
    buf, capture = [], False
    for ln in md.splitlines():
        if ln.startswith("## "):
            capture = qid in ln
            continue
        if capture:
            buf.append(ln)
    return "\n".join(buf)
 
if __name__ == "__main__":
    sys.exit(main(sys.argv[1]))

What carries the weight here is failing the intent layer on the absence of concreteness, not the absence of an answer. Just a length floor (40 chars) and a required reference token (a .ts path or test name) is enough to reject a one-line "Confirmed." You do not need perfect natural-language judgment — the goal is to surface the agent's own contradictions through the act of writing specifics.

Make intent questions ask "how did you confirm it"

The intent questions are mostly decided by their design. First, never let them close on YES/NO. Instead of "is it broken," ask "in which file, in which test, did you confirm it is not broken." Second, always make the last question "did you change any file you should not have." That one is the most effective: when you later see a diff that touches files the agent said it "did not," you send it back even if formal and functional are green. That habit keeps reviews light.

Read the answer file in full, however tedious. More than the correctness of the prose, it honestly reveals how the agent understood the essence of the task. If that understanding is off, the same accident recurs on the next similar task even when every layer is green.

Completion checks under scheduled and unattended runs

Through 2026, scheduled and unattended loops became the default across vendors' agents, Antigravity included. With Managed Agents launched on cron, or isolated-sandbox runs like Gemini's antigravity-preview-05-2026, no human watches each run. This is exactly where the completion contract earns its keep — precisely because nobody is watching, the right to declare done must live in code.

Three things I always add for unattended operation. First, branch on done_check.py's exit code: EXIT_BUDGET and EXIT_INTENT stop auto-merge and divert to a quarantine branch, while only EXIT_CHECK (formal/functional) lets the agent self-correct up to two retries. Second, to stop retry runaway, record consecutive failures per task_id in a state file and cut off at three with an alert only. Third, confirm an unattended run as done in two stages: the verifier green AND a diff audit green.

# scripts/unattended_gate.py (final gate for unattended runs, excerpt)
import subprocess, sys, json
from pathlib import Path
 
STATE = Path(".agent/attempts.json")
 
def load_state():
    return json.loads(STATE.read_text()) if STATE.exists() else {}
 
def main(task_id: str, contract: str) -> int:
    st = load_state()
    rc = subprocess.run(["python", "scripts/done_check.py", contract]).returncode
    if rc == 0:
        # diff audit: catch huge deletions beyond the forbid_paths list
        numstat = subprocess.check_output(
            "git diff --numstat origin/main...HEAD", shell=True, text=True)
        big_del = [l for l in numstat.splitlines()
                   if l and l.split('\t')[1].isdigit() and int(l.split('\t')[1]) > 400]
        if big_del:
            print("AUDIT: deletion over 400 lines, quarantining")
            return 5
        st[task_id] = 0
        STATE.write_text(json.dumps(st)); return 0
 
    st[task_id] = st.get(task_id, 0) + 1
    STATE.write_text(json.dumps(st))
    if rc == 4 or st[task_id] >= 3:       # budget violation or repeated failure
        print(f"QUARANTINE: rc={rc} attempts={st[task_id]}")
        return 9                          # no auto-merge; hand to a human
    print(f"RETRY: rc={rc} attempts={st[task_id]}")
    return rc
 
if __name__ == "__main__":
    sys.exit(main(sys.argv[1], sys.argv[2]))

The quarantine branch keeps the done_check log and intent.md together. Reading just those the next morning tells me why an unattended run stopped. The scariest thing in unattended operation is not failure itself — it is failure flowing quietly into production. Design the cut-off and the alert first, and you can hand work to an agent overnight without dread.

Scale the weight to the task

Imposing all three layers on every task is overkill. A change of one file or less needs only the formal layer; skip the intent questions. A mid-size, multi-file, single-feature change gets formal plus functional, with the task's tests written first. Only large work — directory-level migrations, cross-component refactors — carries all three layers plus the budget and unattended gates. When a task fails the intent layer three times in a row, I treat that as a sign the granularity is too large and split it.

Saying "don't let the agent declare done" sounds dramatic, but all it amounts to is one rule: never hand the definition of completion to someone else. If you give Antigravity anything today, commit even a one-line contract.yaml for it first. From the moment "turn the verifier green" is a shared goal, operations get quietly easier.

Thanks for reading — I hope this helps in your own pipeline.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.