Designing Parallel Agent Changes So You Can Trace Them Later

Antigravity 2.0 became a control tower for many agents. Here is how to build an audit trail that lets you trace who changed what and why, designed from real operational failures.

AI agents²⁴ audit review⁵ operations²⁶ Antigravity 2.0¹⁴

✦ Premium Article

Antigravity 2.0 shifted from a code editor to a "control tower that supervises many agents at once." Subagents running in parallel, scheduled runs in the background. The throughput genuinely goes up.

For me, the price of that throughput arrived as "I have no idea who did what." As an indie developer running four sites in parallel, some mornings a change I do not recognize is sitting in the repo. An agent did it, no doubt, but which task, on what judgment, I cannot trace. That is the scariest state to be in. I fell into exactly this on a Dolice media property and burned half a day finding the cause.

I should have put a trace-it-later design in place before adding agents. This article works backward from that operational failure to show how to build the audit trail.

You lose the thread because intent is never recorded

What makes agent changes untraceable is not the diff itself. The diff stays in git. What is lost is the intent: why the change was made.

A human developer writes intent into the commit message or the PR description. An agent, left alone, tends to leave hollow messages like "Update files." Look at that commit six months later and you cannot tell what it was for.

So the first step of an audit trail is to force intent to be recorded mechanically. Record intent, not just the diff. That is the starting point.

Stamp tracking metadata into the commit trailer

What I use is a convention that always appends structured metadata to the end of the commit message. The body is the human-facing explanation; the trailer is tags for machines to grep.

Add: publish 3 premium articles (claudelab)

Three themes: CLI migration, OS delegation, audit design. JA+EN.

Agent-Task: claudelab-premium-thu
Agent-Run: 2026-06-13T20:14+09:00
Agent-Intent: premium-content-publish
Agent-Gates: article,templating,frontmatter,redirect=pass

Agent-Task tells you which scheduled task it came from, Agent-Run the run time, Agent-Intent the intent, and Agent-Gates which quality gates it passed. When you later wonder "which task was that change," this alone makes tracing instant.

The extraction looks like this.

# List only commits from a specific task, in time order
git log --all --grep="Agent-Task: claudelab-premium-thu" \
  --pretty=format:"%h %ci %s"
 
# Audit for any commit that slipped past the quality gates
git log --all --grep="Agent-Gates:" --pretty=%b \
  | grep "Agent-Gates:" | grep -v "=pass" || echo "All commits passed the gates"

These two greps are my every-morning check. The first tells me which task added what last night; the second confirms no change slipped past a gate.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Commit-trailer metadata that stops 'which agent made this change?' from ever happening

✦A 3-tier gate that separates changes needing human review from those safe to auto-approve

✦A prompt convention that forces a one-line intent, greppable after the fact

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Separate what humans review from what you auto-approve

Having humans review every change is not realistic under parallel operation. Full automation, on the other hand, invites accidents. I split changes into three tiers and vary the gate strength.

Low risk (auto-approve): typo fixes in existing articles, internal link updates, anything with a closed blast radius. Auto-merge on the condition that quality gates pass
Medium risk (post-hoc review): new article publishes, config tweaks. Merge, but a human reviews the diffs in a list the next morning
High risk (pre-approval): adding redirects, deleting articles, anything touching pricing or auth. Nothing reaches production without human approval

The key to this sorting is making risk mechanically decidable by the kind of change. Codify it by file path, like "touching the redirect config file means high risk," and neither the agent nor the human wavers.

# Decide the risk tier automatically from the changed files
changed=$(git diff --name-only HEAD~1)
if echo "$changed" | grep -qE "next.config|pricing|gone-slugs|middleware"; then
  echo "RISK=high: human pre-approval required"
  exit 1   # stop auto-merge for high risk
fi

When I once ran article deletion automatically, it swept up articles that should not have been removed and dropped them from production. Since then I fixed deletion and redirects to high risk unconditionally. The cost of the accident far outweighs the cost of review.

A prompt convention that forces a one-line intent

To get metadata recorded, embed the convention into the instruction to the agent itself. I always put this sentence at the end of a task prompt.

"Append three trailer lines, Agent-Task / Agent-Intent / Agent-Gates, to the commit message. Agent-Intent must be 30 characters or fewer and state why this change is needed."

Compressing the "why" into 30 characters is what works. A change whose intent you cannot say in one line is usually a bad change. If the agent stalls trying to write Agent-Intent, that is a signal to reconsider the change itself. It is the same shape as being asked "what is this PR for" in a human code review and having no answer.

Reduce false positives and match the operational cost

Machine-deciding the risk tier produces false positives at first. In my setup, in the first week after rollout about 30% of high-risk flags were actually safe changes. The file-path rules were too coarse.

Deciding "this is too strict, let me loosen it" gets it backwards. What I adopted was logging the false positives and reviewing each one on the weekend, asking whether it truly was high risk. After roughly two weeks the rules got sharper and false positives settled into single digits.

This has a quiet cost. But one accident with deletion or redirects can drain weeks into App Store review handling and search-ranking recovery. Weigh the cost of avoiding the accident against the chore of review, and I judge that continuing the review is cheaper. Rather than waiting for perfect automated judgment, I recommend an approach that tolerates false positives while sharpening accuracy.

Build the trail before you scale up

The lesson of the audit trail, in one line: put the trace-it mechanism in place before you add agents. Reverse the order, and you cannot go back once untraceable changes have piled up.

I am a little wary of the word "control tower." Being able to run many agents at once and being able to answer for everything they do are different abilities. Tools give you the former; only design gives you the latter. If you are running even one agent automatically today, start with the three trailer lines. Tomorrow's you will be able to trace last night's agent.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.