Replay-Driven Agent Design — Time-Travel Debugging for Production AI Agents
Reproduce one-off agent failures from production on your laptop. A practical three-layer replay design — event, state, and decision — built on top of Antigravity's Manager Surface, with TypeScript code you can drop into your own stack.
A few months ago, an agent I had embedded in one of our internal products did something strange in the middle of the night. It called a tool it should never have called, repeated that call several times, then returned an empty response and stopped. The logs only kept the result. Re-running the same input produced perfectly normal behavior. I was left with a string and a question I could not answer.
That night taught me something I keep coming back to: agent bugs are less like code bugs and more like phenomena. They happen when a particular input, a particular model sample, and a particular tool state happen to line up at the same moment. A stack trace will not save you. If you cannot reproduce it, you cannot fix it.
This article is the design I have slowly assembled across several production agents — a three-layer replay model and concrete TypeScript implementations you can run on top of Antigravity's Manager Surface. I am writing it the way I wish someone had written it for me when I started.
Why Agents Lose Reproducibility So Easily
Conventional web applications are mostly reproducible. Given the same input and the same database state, you get nearly the same output. There are non-deterministic edges — transaction isolation levels, time-dependent logic — but they are bounded. A non-reproducible bug is a rare event.
Agents do not work that way. The same prompt produces slightly different outputs every time, depending on probabilistic model sampling, tool call ordering, external API latency, and the surrounding world state at that moment. I have started thinking of agents as dynamic systems built on top of an absence of reproducibility.
You will hear the advice "set temperature: 0 and you get determinism." In practice, that is only the start. Even with temperature zero, if a tool returns a different result, the next response shifts. With multiple agents, scheduler timing introduces ordering variance. The right move is to give up on perfect determinism and instead make a deliberate design choice about how much reproducibility is enough. That is where replay design begins.
The Three-Layer Model
The mental model I have ended up with separates replay into three distinct layers. Once you separate them, the rest of the design falls into place naturally.
Event Layer: the raw stream of inputs, tool calls, and model responses the agent received and emitted. Think of it as the write-ahead log. With this alone, you can recompute most of the rest.
State Layer: a snapshot of the agent's internal state at a moment in time — memory, context window, tool connection state. It serves as a checkpoint you can validate against by replaying the event log forward.
Decision Layer: the reasoning context behind a specific model decision — full prompt, tool schemas, model parameters (temperature, seed), generated text, and ideally logprobs. The model itself is opaque, so you preserve the inputs and outputs as a paired record.
Separating these three layers gives you a precise debugging tool. When a bug appears, you replay one layer at a time and see which one triggers it. If the event layer alone reproduces the failure, it is deterministic. If you need the decision layer to reproduce it, the failure is rooted in the model's probabilistic behavior. The shape of the bug becomes visible.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦You can finally reproduce that one-off bug that only ever showed up at 3am in production, by replaying the exact session on your local machine
✦You'll have a working three-layer replay foundation in TypeScript that separates events, state, and decisions — and you can drop it into your own agent stack today
✦You'll know exactly which layer to add at each stage of growth, from a weekend-built MVP to a regulated, audited production system, without overbuilding the foundation
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
The simplest pattern replays the event log directly. You record every tool call's input and result during the live session; on replay, you skip real tool execution and return the recorded results in order.
// agent-replay/recorder.tsimport { writeFileSync, readFileSync } from "node:fs";export type ReplayEvent = | { kind: "user_message"; ts: number; content: string } | { kind: "model_call"; ts: number; prompt: string; response: string; seed?: number } | { kind: "tool_call"; ts: number; name: string; args: unknown; result: unknown } | { kind: "agent_state"; ts: number; snapshot: Record<string, unknown> };export class SessionRecorder { private events: ReplayEvent[] = []; constructor(private sessionId: string) {} record(event: Omit<ReplayEvent, "ts">) { this.events.push({ ...event, ts: Date.now() } as ReplayEvent); } flush(path: string) { writeFileSync(path, JSON.stringify({ sessionId: this.sessionId, events: this.events }, null, 2)); }}export class ReplayPlayer { private events: ReplayEvent[]; private cursor = 0; constructor(path: string) { const data = JSON.parse(readFileSync(path, "utf8")); this.events = data.events; } // Return the next recorded tool result for `name`. Real tools are not invoked. async replayTool(name: string, args: unknown): Promise<unknown> { while (this.cursor < this.events.length) { const ev = this.events[this.cursor++]; if (ev.kind === "tool_call" && ev.name === name) { if (JSON.stringify(ev.args) !== JSON.stringify(args)) { // Soft-warn: tolerate model-side input drift console.warn(`[replay] tool args drift on ${name}`); } return ev.result; } } throw new Error(`Replay exhausted for tool ${name}`); }}
The non-obvious detail is the soft warning on argument drift. My first version raised an error when arguments did not match exactly. In practice, the model nudges its tool inputs slightly between runs even with the same recorded prompts, and strict matching kept stopping replays before they reached the actual bug. Strict on the sequence of tools, lenient on the fine details of arguments — that balance is the difference between a replay layer that survives in production and one that gets ripped out within a week.
Pattern 2: Causal Replay
Deterministic replay gives you reproducibility, but it cannot explain why events happened in the order they did. As soon as you have parallel tool calls or multiple agents, sequencing itself becomes the bug. Causal replay records the dependency relationship between events explicitly.
Borrowing from Lamport timestamps, each event carries a list of "parent event IDs" it depends on. The result is a directed acyclic graph that can be replayed by topological sort, regardless of wall-clock ordering.
// agent-replay/causal.tstype CausalEvent = { id: string; causes: string[]; // parent events this one depends on agentId: string; payload: unknown;};export class CausalLog { private events: Map<string, CausalEvent> = new Map(); append(ev: Omit<CausalEvent, "id">): string { const id = crypto.randomUUID(); this.events.set(id, { ...ev, id }); return id; } // Yield events in causal order. Observation jitter no longer matters. *replayCausal(): Generator<CausalEvent> { const visited = new Set<string>(); const order: CausalEvent[] = []; const visit = (id: string) => { if (visited.has(id)) return; visited.add(id); const ev = this.events.get(id); if (!ev) return; ev.causes.forEach(visit); order.push(ev); }; [...this.events.keys()].forEach(visit); yield* order; }}
I first reached for causal replay when I started running parallel sub-agents under Antigravity's Manager Surface. Three agents called tools concurrently, and one day the system began producing slightly different results every run. The logs interleaved events in wall-clock order, which told me nothing about which decisions had influenced which. Once I introduced causal edges, the same kind of failure became something I could see — "agent A's response shaped agent B's next choice" — almost like reading a diagram.
It can look like overengineering until you start running multiple agents in parallel. The moment you do, conversation about the bug becomes impossible without it.
If you have not yet built a structured multi-agent pipeline, I'd suggest reading Multi-Agent Orchestration in Practice first. The orchestration choices you make there determine what the causal graph needs to capture.
Pattern 3: Audit Replay
The third pattern is for compliance and after-the-fact accountability. Audit replay preserves the decision layer at a granularity that can be presented to a third party. Concretely, you persist the full prompt, tool schemas, model parameters, the generated response, and optionally logprobs, in a tamper-evident form.
// agent-replay/audit.tsimport { createHash } from "node:crypto";export type AuditRecord = { sessionId: string; step: number; prompt: string; toolSchemas: unknown[]; modelParams: { temperature: number; seed?: number; topP?: number }; response: { text: string; toolCalls: unknown[] }; prevHash: string; hash: string;};export class AuditChain { private records: AuditRecord[] = []; private prevHash = "GENESIS"; append(record: Omit<AuditRecord, "prevHash" | "hash">) { const enriched = { ...record, prevHash: this.prevHash }; const hash = createHash("sha256") .update(JSON.stringify(enriched)) .digest("hex"); const full: AuditRecord = { ...enriched, hash }; this.records.push(full); this.prevHash = hash; return full; } // Verify integrity by recomputing each hash and following prevHash. verify(): boolean { let prev = "GENESIS"; for (const r of this.records) { if (r.prevHash !== prev) return false; const expected = createHash("sha256") .update(JSON.stringify({ ...r, hash: undefined })) .digest("hex"); if (expected !== r.hash) return false; prev = r.hash; } return true; }}
A hash chain feels heavy at first. I ran with plain JSON logs for a while and assumed that was sufficient. Then a client engagement raised a question I could not technically answer: "Can you prove these decision logs were not edited after the fact?" Plain JSON cannot prove that. A hash chain can.
You do not need audit replay for every agent. But if you intend to operate in domains adjacent to healthcare, finance, or law, putting it in from day one is far easier than retrofitting it under regulatory pressure later.
How Much of This Should You Actually Build?
You do not need all three layers from day one. The rules of thumb I use, depending on the stage of the product, are:
Solo developer / MVP: event layer only (deterministic replay). Write JSON logs in production, replay them on your laptop. This alone reproduces about 90% of bugs.
Team / multiple agents: event layer + causal layer. The moment you ship parallel sub-agents, conversations about bugs cease to be productive without causal edges.
Production / enterprise: all three layers. Audit replay especially is not something to bolt on after a contract or regulation lands — by then it is almost impossible.
The implementation cost increases roughly in that order, with the decision layer being the heaviest. Storage matters too: dumping full prompts can grow into tens of gigabytes within six months. Pair cheap object storage (S3, R2) with encryption-at-rest and a retention policy, and your future self will thank you.
If you have not designed agent observability yet, the foundation is covered from a different angle in Agent Trace Observability Design. Replay is re-running the past you've already observed; the two layers compose well.
Before / After: What Actually Changes on the Ground
This is from my own team, written down honestly.
Before (no replay layer): Two or three reports a week we could not reproduce. Engineers patched code based on their best guess, deployed unconfident fixes, and waited to see whether the bug returned. There was no fault to assign, but the psychological cost accumulated quickly.
After (deterministic replay in place): When a report comes in, we pull the recording by session ID and replay it on a developer machine. If it reproduces, we know the cause. If it does not, we can clearly say "this happened outside the recording boundary." The conversation moves from speculation to certainty.
Mean time to resolution dropped from three days to four hours. But the more important shift, for me, was cultural: "explainable fixes" became the default register for our discussions. Agents are a flashy domain. What actually creates calm on the ground is quiet, careful observation and reproducibility — exactly the unfashionable engineering disciplines.
A Minimum Setup You Can Build This Weekend
If everything above sounds heavy, the minimum viable version is genuinely small. You can have it running by Saturday afternoon.
Add a single SessionRecorder file. Record three event kinds: user messages, model responses, tool results.
Flush recordings to replays/{sessionId}.json. No S3, no KV, no observability platform.
When a bug report arrives, run ReplayPlayer against the file. Just see whether you can reproduce it.
Even this minimal layer changes how operations feel. I once tried to introduce a full distributed-tracing stack and stalled out. Coming back to this small recorder is what actually shipped. Do not build a perfect observability platform first; secure your ability to replay the past first.
This pairs naturally with proactive defenses, which I cover in Agent Safety Guardrails Guide. Guardrails protect you ahead of time; replay protects you afterwards. Together they give you a meaningful safety envelope.
Closing Thoughts
Building agents is, in some sense, the continuous practice of designing for things you cannot reproduce. Probabilistic decisions, external state, implicit ordering between tools — try to make all of it deterministic and you remove what makes the agent useful in the first place. Leave it untouched and you cannot meet your own bugs honestly.
Replay design is one of the ways to walk the line between those two extremes. You give up on perfect determinism. You decide carefully where to draw the line. You record everything up to that line with discipline. You accept the rest as probability. To me this looks less like a technical choice and more like a posture for working with AI systems.
If you take one action from this article, drop a SessionRecorder into your agent today. The first time a recording sits on disk, your relationship with agent bugs will change.
Thank you for reading this far. If you ever experiment with replay design and discover something I missed, I would love to hear about it — those notes from the field are how all of us, myself included, keep learning.
Recording Hooks: Where to Tap Into the Agent Loop
Knowing the three-layer model is one thing. Wiring it into a real agent loop without making the code unreadable is another. The placement of recording calls determines whether the layer stays maintainable or quietly rots over six months.
The pattern I have settled on is to wrap the agent's I/O boundaries — model calls and tool calls — rather than instrumenting every internal function. That gives you the highest signal-to-noise ratio. The internal control flow of the agent is rarely what causes bugs; the boundary between the agent and the world almost always is.
Wrap your tools and model calls once at construction time, leave the rest of the agent code untouched. The agent itself does not know it is being recorded. When you decide to disable recording in a hot path, you swap the wrapper at that single seam.
A small detail worth flagging: I keep a "recording mode" flag at the recorder, not at the wrappers. That way live, record, and replay modes share the exact same call-site shape. If you have to reach into wrapper code to switch modes, the abstraction is leaking and replay drift will eventually creep in.
Snapshotting State Without Stopping the World
The state layer is the trickiest of the three to get right in production. You want a consistent snapshot of memory, context, and tool state — but you cannot pause an in-flight agent to take it. The two strategies that have worked for me are checkpoint on idle and copy-on-write snapshots.
Checkpoint-on-idle is the simpler of the two. The agent finishes a turn, returns control to the loop, and right before the next turn begins, you serialize state to disk. This works for chat-like agents where idle moments are clearly bounded.
Copy-on-write is needed when the agent runs continuously, for example background sub-agents under Manager Surface that never naturally idle. You wrap the state container in a CoW data structure (Immer is one nice option in TypeScript) and snapshot the immutable reference at any point. State mutations happen on the new copy, the snapshot stays valid.
// agent-replay/state-snapshot.tsimport { produce, type Draft } from "immer";export class AgentState<T> { constructor(private state: T) {} // Immutable update; returns the previous frozen reference for snapshotting. update(mutator: (draft: Draft<T>) => void): T { const previous = this.state; this.state = produce(this.state, mutator); return previous; } snapshot(): T { return this.state; }}
The principle to internalize: a snapshot is a value, not an event. You should be able to take it without coordinating with anything. The moment snapshotting requires "stopping the world," it stops being taken regularly, and a state layer that is taken inconsistently is worse than no state layer at all.
Replay Boundaries: What You Cannot Replay
Honesty about the limits of replay is, in my experience, the part that earns you the most trust from teammates. There are categories of behavior that no replay system can reproduce. Stating them up front saves arguments later.
External world state is the obvious one. If the agent called a search API at 2am and the search index has updated since then, replaying the same call returns different results. Your replay layer must stub the recorded result rather than re-querying the live API.
Time-dependent decisions are the second. If the agent's prompt referenced "today's date," replaying it tomorrow with the live clock will produce a different downstream result. Inject a clock abstraction and record the timestamp the agent observed.
Random sources outside the model are the third — UUIDs the agent generated, retry jitter, randomized scheduling. Wrap each random source and record the values it returned during the live session. On replay, return the recorded values.
The pattern across all three is the same: anything the agent observes from outside its own deterministic logic must be captured at the boundary, not recomputed during replay. Once you internalize this, the design generalizes naturally.
Storage Strategy: Hot vs. Cold Recordings
Recordings can grow surprisingly fast. A modest agent making twenty tool calls per session, with prompts of a few thousand tokens each, can generate a megabyte of recording per session. At a thousand sessions per day, that is a gigabyte daily and roughly a third of a terabyte annually. Untiered, this becomes expensive.
The storage tier I use in production has three layers:
Hot tier (last 7 days): kept on a fast object store (Cloudflare R2 or S3 with infrequent-access excluded). This is what you reach for when a fresh bug report comes in.
Warm tier (7-90 days): moved to infrequent-access storage classes. Latency is slightly higher, cost drops by half or more. Most replays still happen here.
Cold tier (90 days+): pushed to Glacier-class storage with retention rules. You touch this only for compliance investigations or post-incident analysis older than three months.
Move recordings between tiers using lifecycle rules rather than ad-hoc scripts. Lifecycle rules are declarative and survive personnel changes; ad-hoc scripts always rot. The audit chain's prevHash linkage continues to work across tiers because each record carries its own hash, independent of where the bytes physically live.
Privacy: Recordings Contain Everything Sensitive
A replay layer is, by definition, a database of every prompt your users have ever sent and every tool result your agent has ever seen. That includes whatever sensitive content happens to flow through. Treating this with the same care as a primary user database is non-negotiable.
The minimums I apply to any replay store:
Encryption at rest, using customer-managed keys when the platform supports it.
Access logging on the replay store itself. Every read of a recording leaves a trace.
Redaction at recording time for known sensitive patterns (emails, phone numbers, credit card numbers). Better to lose a little debugging fidelity than to retain liability.
Right-to-delete propagation. When a user deletes their data, recordings tied to their session ID must be deleted too. This is harder than it looks if recordings are scattered across tiers.
The first time you treat the replay store as "just a debug log," some part of it will end up in a screenshot, a Slack message, or a public bug report. This is not a hypothetical risk; I have seen all three happen. Build the privacy boundary first, treat the debugging utility as a secondary feature.
Replaying Across Model Versions
Models change. The provider rolls out a new minor version, deprecates an old one, or silently swaps weights underneath your endpoint. A recording captured against version A may produce a wildly different response when replayed against version B, even with the same prompt and parameters.
The discipline I have started practicing: record the model version string and provider response headers as part of the decision layer, then refuse to "live-replay" against a different model version. The replay player should detect the mismatch and require an explicit --cross-version flag from the engineer, with a warning that drift is now possible.
For deterministic replay (event layer), this matters less because tool results are stubbed regardless of which model is asked. But for any analysis that involves re-running the model, version pinning is the only honest move.
I learned this the slow way. We had a recording from before a model update, replayed it after the update to investigate something else, and concluded incorrectly that a tool was misbehaving — when in fact it was the new model version producing a different response. A morning I wish I had back.
When Replay Becomes a Test Suite
A surprising side effect of building a replay layer: your accumulated recordings start functioning as a regression test suite. You curate a small set of recordings that capture canonical behaviors and edge cases. Each time you change the agent's code, prompt, or tools, you replay that set and check for divergence.
The naive form is "compare final outputs." That breaks too easily — even minor model variation produces different surface text. The form that holds up is to compare the sequence of tool calls at the event layer. The text varies, but the structural decisions of the agent ("which tools, in which order, with which arguments") stay stable enough to be a meaningful diff.
// agent-replay/regression.tsimport { ReplayEvent } from "./recorder";export function toolCallTrace(events: ReplayEvent[]): string[] { return events .filter((e): e is Extract<ReplayEvent, { kind: "tool_call" }> => e.kind === "tool_call") .map((e) => `${e.name}(${Object.keys(e.args ?? {}).sort().join(",")})`);}export function compareTraces(a: ReplayEvent[], b: ReplayEvent[]): { ok: boolean; diff: string } { const ta = toolCallTrace(a); const tb = toolCallTrace(b); if (ta.length !== tb.length) return { ok: false, diff: `length ${ta.length} vs ${tb.length}` }; for (let i = 0; i < ta.length; i++) { if (ta[i] !== tb[i]) return { ok: false, diff: `step ${i}: ${ta[i]} vs ${tb[i]}` }; } return { ok: true, diff: "" };}
Compare the signatures of tool calls, not the literal arguments. This is forgiving enough to survive the model's natural variation, strict enough to catch a regression where the agent stops calling a critical tool. Run the comparison in CI, fail the build on divergence.
This is the moment when a replay layer stops being purely a debugging tool and starts paying compounding dividends. You are no longer just recovering from past bugs; you are actively preventing future ones.
Closing the Loop With Failure-History Learning
The last layer worth mentioning is what to do with the failed recordings. Once you have a stockpile of bugs that have been reproduced and fixed, that stockpile is not just historical — it is training data for your evaluation pipeline.
Tag each fixed recording with a short description of what went wrong and what the correct behavior would have been. When you change a prompt, a tool, or the agent's structure, run those tagged recordings through the new agent and check whether it would have produced the correct behavior this time. The replay layer becomes the foundation for a learning loop that flags regressions before they reach production.
I treat this as the natural endpoint of replay design — the place where reproducing the past becomes preventing the future. Most teams never reach it because they never built the recording layer in the first place. If you build it well, the loop closes itself.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.