Tracing What a Long Agent Run Actually Did: Review That Starts From In-Conversation Search

Have you ever scrolled an agent conversation that ran for hundreds of steps, top to bottom, trying to find where it made a decision? The output looks right, but you cannot trace what it decided where, so you end up dragging the scrollbar back and forth. That was the time I found most wasteful.

Antigravity v2.1.4 added cmd/ctrl+F search inside the conversation view. It looks like a small feature, but for reviewing long agent runs it changes the starting point entirely. Instead of reading everything, you jump to the decision points by search and read only those closely. This article lays out a review workflow built around that search—from choosing search terms to reconciling with background-agent logs.

Stop reading the whole thing

Reading a long conversation from the top costs the same time regardless of how important each decision was. What you really want to see in an agent review is not the output itself but "where it set the direction." Once you frame search as the tool for jumping straight to those branch points, how to use it becomes clear.

I split a review into three stages.

Stage	What to inspect	Example search terms
Direction branches	Where the agent narrowed its options	"instead", "for the following reason", "rather than"
External effects	Where a write, run, or send happened	"git push", "rm ", "created"
Uncertainty	Where the agent hesitated or guessed	"probably", "I assume", "likely"

Of these three, the first to inspect is "external effects." What the agent rewrote and what it executed is the part you cannot take back in a review. Judging the quality of its reasoning can wait; first pin down the effects by search.

Search from the words that caused side effects

Concretely, search first for words that correspond to file writes and command execution. When an Antigravity agent uses execution tools, the conversation holds the commands it ran and the paths it created. Picking those up by search gives you a list of side effects in tens of seconds.

The first search terms I type are mostly fixed.

Execution: Running, Bash, Terminal, executed
Writes: Created, Edited, Wrote, updated
Destructive ops: rm , DROP, --force, deleted
Outbound: push, POST, deploy, published

I type the destructive-op terms partly to confirm there are zero hits. Being able to verify "it did nothing" by search is faster and more reliable than eyeballing. When there is a hit, I read only the surrounding lines and judge whether the operation was intended.

Pick up the traces of hesitation

Once side effects are pinned, the next set is the uncertainty terms. When an agent is not confident, it leaves characteristic phrasing. Searching for "probably", "it seems", "I assume", "likely" surfaces the spots where it proceeded on a guess.

This is where the correctness of the output is ultimately decided. In my experience, a large share of decisions that caused problems later were near these hesitation words. If the agent wrote "the config file is probably here" and moved on, I verify against the real thing whether that guess was right. If it was, no problem; if it missed, everything downstream may be off.

Jump to these words by search and read only one or two steps around each. With this approach, even a conversation of hundreds of steps needs close reading of perhaps 5 to 10 spots. Compared with reading all of it, the review time dropped to a fraction in practice.

Reconcile background agents against logs

In-conversation search is powerful in the chat view, but for an agent that ran unattended in the background, it is faster to look at the logs before opening the conversation. Antigravity's background and scheduled runs leave execution logs, so I apply the same search terms to the log side, form a hypothesis, and then open the conversation.

# Extract only side effects and traces of hesitation from background-agent logs
LOG_DIR="$HOME/.antigravity/agent-runs"
 
# On recent run logs, confirm destructive ops and outbound sends first
grep -rniE "rm -rf|--force|drop table|git push|deploy" "$LOG_DIR" \
  | tail -40
 
# Surface the spots where it proceeded on a guess (a starting point to suspect drift)
grep -rniE "probably|it seems|i assume|maybe|likely" "$LOG_DIR" \
  | wc -l

Grasp the count and location on the log side, then open the matching conversation and jump to the same words with cmd/ctrl+F. With this two-step approach, even when several background agents run in parallel, you narrow down which run and which spot to look at first. As an indie developer, when I run wallpaper-app asset generation in the background, I check in this order—logs, then conversation—every time. If the logs confirm zero destructive ops, I can review the conversation focused calmly on just the hesitation words.

Make the search terms your own defaults

Finally, a note on habit. If you invent search terms on the spot for every review, you will miss things. I recommend deciding on about ten of your own default terms across the three categories—side effects, hesitation, and direction branches. Even when the project or language changes, this skeleton carries over.

The benefit of fixed defaults is that the review becomes reproducible. If anyone can jump to the same branch points with the same words, review quality stops depending on the person. Even as an indie developer working alone, being able to review with the same criteria as my past self is quietly valuable.

Next time you let an agent run long, try searching one "destructive-op word" instead of reading from the top. Just knowing there are zero hits should settle how you enter the review. From there, add terms for side effects, hesitation, and branches, and you will stop losing your starting point even against a long conversation.