Tracing Parallel Agents After the Fact: Observability with Structured Logs and Spans
Running multiple agents in parallel on the Antigravity 2.0 desktop makes it impossible to tell which one is doing what. I share an observability design that drops tangled print debugging for run_ids and spans you can trace afterward, with a solo-operator implementation and numbers.
One morning I had six agents running in parallel, and one of them was rewriting the wrong file. The problem was that even reading the logs, I couldn't tell which agent, at which stage, had done it. Output from all six poured into the same console in time order, and I couldn't isolate the one broken process. Triage took 40 minutes, most of it spent just hunting for the log lines worth reading.
The Antigravity 2.0 desktop runs multiple agents in parallel and schedules them in the background. As an indie developer running several apps and sites alone, that parallelism lifts productivity a lot. But the moment work goes parallel, logs stop being a linear story. Unless you design observability — the ability to trace a run after the fact — up front, parallelism turns directly into "untraceability."
Why print debugging breaks under parallelism
While you run one agent at a time, you can follow the story by reading the log top to bottom. Go parallel, and several stories interleave on the same screen. If you can't tell which agent the line you're reading belongs to, the log is noise, not information.
Worse, a failure doesn't necessarily appear at the end of the log. When a dead agent's last line sits next to a healthy agent's line, you misread an unrelated line as the cause. I wasted time on wrong fixes this way more than once.
There's only one direction out: attach to every log line, in machine-readable form, which run, which agent, and which stage it belongs to.
Tag every log with run_id and agent_id
Assign one run_id to the whole batch run and one agent_id to each agent, and include both in every log. Treat them not as human-readable text but as keys for mechanical filtering later.
interface LogContext { run_id: string; // one per batch run agent_id: string; // one per agent span?: string; // current phase name}function createLogger(ctx: LogContext) { const emit = (level: string, msg: string, extra: Record<string, unknown> = {}) => { // One JSON per line: filterable by machine, not just grep console.log(JSON.stringify({ ts: new Date().toISOString(), level, ...ctx, msg, ...extra, })); }; return { info: (m: string, e?: Record<string, unknown>) => emit("info", m, e), error: (m: string, e?: Record<string, unknown>) => emit("error", m, e), child: (agent_id: string) => createLogger({ ...ctx, agent_id }), };}
With one JSON per line, you can filter completely by run_id and agent_id afterward. The "six tangled" problem dissolves: filter by agent_id and only that agent's lines remain. Swapping plain-text print for JSON structured logs is the single move that changes how triage feels.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Get the full TypeScript implementation of a structured logger tagged with run_id, agent_id, and span
✦Learn to propagate a correlation ID from parent to child agent and pinpoint the source of a failure in one minute
✦See the input-snapshot and dashboard rules that cut incident triage time from 40 minutes to 5 on average
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
An agent run roughly splits into phases — plan, execute, verify. To measure which phase took time and where it failed, record the start and end of each phase as a span.
With spans, you see at a glance whether a failure happened in verify or in execute. In my operation, searching for span_end with ok: false yields the distribution of failing phases across all agents. If verify failures dominate, I know to revise the verification instructions.
Trace parent to child with a correlation ID
In Antigravity 2.0 a parent agent often spawns subagents to coordinate parallel work. Being able to trace a subagent's failure back to the parent context that caused it makes root-cause analysis far faster.
To do that, always carry the parent's run_id into the child and give the child its own agent_id. The child() method above does exactly this.
Now you can filter by run_id for a whole run, agent_id for an individual agent, and span for a phase — independently. "Within this run, show me only writer-1's execute" is answered instantly by combining three keys.
Make failures reproducible with input snapshots
Even when logs tell you where a failure happened, you can't reproduce why without the input. Parallel agents depend on external state (files, API responses), so the same instruction yields different results under different inputs.
On failure I record the agent's input — task definition, a hash of referenced files, a digest of the final prompt — as one snapshot.
async function snapshotOnFailure( logger: ReturnType<typeof createLogger>, input: { task: string; files: string[]; promptDigest: string }, fn: () => Promise<void>,) { try { await fn(); } catch (err) { logger.error("failure_snapshot", { task: input.task, file_count: input.files.length, prompt_digest: input.promptDigest, // a digest/hash, not the full text err: String(err), }); throw err; }}
The snapshot keeps a digest or hash, not the full prompt — full text invites secrets and bloat. Even so, knowing which input failed lets me reproduce and fix it locally. After adding this, failures that get shelved because they can't be reproduced nearly vanished.
Dashboards and operating rules
Once structured logs exist, you only need to decide how to look. At the end of every run I emit a one-page summary aggregating span_end by agent.
The first rule is quiet on success, loud only on failure. Only runs with at least one ok: false get the failing phase and agent_id printed in bold at the top of the summary. Your eyes land on the one line worth reading without reading the whole log.
Second, record span durations (ms) across runs and flag runs that stray far from the usual median as anomalies. A run where execute suddenly runs 3x slower usually hides a model rate limit or a degraded external API. You catch latency anomalies early even without a failure.
Third, stamp the run_id onto the final artifact (commit or post) too. Being able to reverse-look-up "which run produced this post" lets you go straight from a bad artifact to its execution log.
Before and after these changes, incident triage time dropped from 40 minutes to 5 on average. It shrank not because I added a clever analysis tool, but because I started emitting logs in a "traceable" form from the start.
If you're going to delegate something in parallel, building in "traceable after the fact" before you delegate is, in my experience, ultimately the fastest path. Start by replacing print with one-JSON-per-line logs and tagging run_id and agent_id.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.