ANTIGRAVITY LABJP
Articles/Agents & Manager
Agents & Manager/2026-05-07Advanced

Replay-Driven Agent Design — Time-Travel Debugging for Production AI Agents

Reproduce one-off agent failures from production on your laptop. A practical three-layer replay design — event, state, and decision — built on top of Antigravity's Manager Surface, with TypeScript code you can drop into your own stack.

agents105replaydebugging15observability16advanced19premium15

Premium Article

A few months ago, an agent I had embedded in one of our internal products did something strange in the middle of the night. It called a tool it should never have called, repeated that call several times, then returned an empty response and stopped. The logs only kept the result. Re-running the same input produced perfectly normal behavior. I was left with a string and a question I could not answer.

That night taught me something I keep coming back to: agent bugs are less like code bugs and more like phenomena. They happen when a particular input, a particular model sample, and a particular tool state happen to line up at the same moment. A stack trace will not save you. If you cannot reproduce it, you cannot fix it.

This article is the design I have slowly assembled across several production agents — a three-layer replay model and concrete TypeScript implementations you can run on top of Antigravity's Manager Surface. I am writing it the way I wish someone had written it for me when I started.

Why Agents Lose Reproducibility So Easily

Conventional web applications are mostly reproducible. Given the same input and the same database state, you get nearly the same output. There are non-deterministic edges — transaction isolation levels, time-dependent logic — but they are bounded. A non-reproducible bug is a rare event.

Agents do not work that way. The same prompt produces slightly different outputs every time, depending on probabilistic model sampling, tool call ordering, external API latency, and the surrounding world state at that moment. I have started thinking of agents as dynamic systems built on top of an absence of reproducibility.

You will hear the advice "set temperature: 0 and you get determinism." In practice, that is only the start. Even with temperature zero, if a tool returns a different result, the next response shifts. With multiple agents, scheduler timing introduces ordering variance. The right move is to give up on perfect determinism and instead make a deliberate design choice about how much reproducibility is enough. That is where replay design begins.

The Three-Layer Model

The mental model I have ended up with separates replay into three distinct layers. Once you separate them, the rest of the design falls into place naturally.

  • Event Layer: the raw stream of inputs, tool calls, and model responses the agent received and emitted. Think of it as the write-ahead log. With this alone, you can recompute most of the rest.
  • State Layer: a snapshot of the agent's internal state at a moment in time — memory, context window, tool connection state. It serves as a checkpoint you can validate against by replaying the event log forward.
  • Decision Layer: the reasoning context behind a specific model decision — full prompt, tool schemas, model parameters (temperature, seed), generated text, and ideally logprobs. The model itself is opaque, so you preserve the inputs and outputs as a paired record.

Separating these three layers gives you a precise debugging tool. When a bug appears, you replay one layer at a time and see which one triggers it. If the event layer alone reproduces the failure, it is deterministic. If you need the decision layer to reproduce it, the failure is rooted in the model's probabilistic behavior. The shape of the bug becomes visible.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
You can finally reproduce that one-off bug that only ever showed up at 3am in production, by replaying the exact session on your local machine
You'll have a working three-layer replay foundation in TypeScript that separates events, state, and decisions — and you can drop it into your own agent stack today
You'll know exactly which layer to add at each stage of growth, from a weekend-built MVP to a regulated, audited production system, without overbuilding the foundation
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Agents & Manager2026-05-27
Record & Replay for Antigravity Agents — A Production Pattern to Reproduce Failures in 3 Minutes
How to deterministically replay a failed Antigravity Agent run offline, drawn from a month of running it across four production sites. Covers boundary recording, R2 + KV storage costs, PII masking, and a working TypeScript harness.
Agents & Manager2026-03-14
Multi-Agent Orchestration in Practice — Design Patterns and Implementation
Learn how to coordinate multiple AI agents with orchestration patterns. Covers router, pipeline, and consensus patterns with TypeScript implementation examples.
Agents & Manager2026-06-17
Tracing Parallel Agents After the Fact: Observability with Structured Logs and Spans
Running multiple agents in parallel on the Antigravity 2.0 desktop makes it impossible to tell which one is doing what. I share an observability design that drops tangled print debugging for run_ids and spans you can trace afterward, with a solo-operator implementation and numbers.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →