An Observability Blueprint for Antigravity Agents in Production

The first wall you hit running AI agents in production is "I have no idea what's actually happening." With traditional web apps, request latency and error rate told you most of the story. AI agents are different. The HTTP response is 200, but the answer is wrong, the agent called the wrong tool, or the API bill is silently exploding.

I hit this wall the moment I started running Antigravity agents in production. This article presents the observability framework I built from those experiences — directly translatable to your stack.

Why AI Agent Observability Is a Different Beast

Traditional observability rests on Three Pillars: metrics, logs, traces. Combine all three and you can reconstruct system state. That's been the gospel for over a decade.

For AI agents, the three pillars aren't enough. There are three things specific to agents you must observe.

First, thought: what reasoning the agent did before acting. This shows up as the LLM's chain-of-thought output. Second, action: which tool was called with which arguments. This is the only layer producing side effects. Third, outcome: did the final result match the user's intent. Often measured via human feedback or A/B tests.

Conflating these into a single log stream paints you into a corner during root-cause analysis. In Antigravity, where multiple agents coordinate, getting the granularity right from day one is critical.

Pillar One: Structured Traces to Reconstruct Thought

To trace agent behavior after the fact, you need structured execution history. Borrow OpenTelemetry's span model and decompose agent runs like this:

// Trace structure example
{
  spanName: "agent.run",
  attributes: { agent_id: "support-bot-v3", session_id: "..." },
  children: [
    {
      spanName: "agent.thought",
      attributes: { reasoning: "User is requesting a refund..." },
      children: [
        {
          spanName: "tool.call",
          attributes: { tool: "lookup_order", args: { orderId: "..." } },
          duration_ms: 245,
        },
        {
          spanName: "tool.call",
          attributes: { tool: "issue_refund", args: { amount: 1500 } },
          duration_ms: 1820,
        },
      ],
    },
    {
      spanName: "agent.response",
      attributes: { tokens: 312, model: "claude-sonnet-4-6" },
    },
  ],
}

This hierarchical structure lets you answer "why did this agent issue a refund?" after the fact. The crucial part is separating thought (reasoning) from action (tool.call) into distinct spans. Now you can isolate whether the reasoning was wrong or whether the tool invocation was wrong.

In Antigravity, decorators like @trace.span() instrument agents with minimal code changes. For the first week in production, sample at 100% so you capture the typical behavior patterns.

Pillar Two: Action Metrics for Early Anomaly Detection

Traces excel at detail; metrics excel at trends. You need both.

Five agent-specific metrics I treat as non-negotiable:

agent_invocations_total:
  type: counter
  labels: [agent_id, outcome]   # success, partial, error
 
agent_tool_calls_total:
  type: counter
  labels: [agent_id, tool_name, status]
 
agent_tokens_consumed:
  type: histogram
  labels: [agent_id, model, direction]  # input or output
 
agent_response_latency_seconds:
  type: histogram
  labels: [agent_id, model]
 
agent_cost_usd:
  type: counter
  labels: [agent_id, model]

agent_cost_usd is the lifeline. When agents start firing off unexpected tool calls, costs spike fast. Antigravity's execution history makes real-time cost aggregation possible.

Visualize these in Prometheus + Grafana (or equivalent), and wire up at minimum these alerts. Per-agent error rate doubling week-over-week. Tokens-per-session exceeding 3x the median. Hourly cost crossing budget — ideally with auto-shutdown of misbehaving agents.

Pillar Three: Outcome Logs for Continuous Quality Improvement

The third pillar is the outcome log. Design it as a separate concept from traditional application logs.

After a complete agent interaction, log the outcome:

Session ID (links back to the trace)
The user's original request (first message)
The agent's final response
User feedback (explicit ratings, or inferred from subsequent actions)
Auto-evaluation scores (LLM-as-judge, rubric-based scoring)

Use these logs as a dataset for quality improvement. Mining low-scoring outcomes surfaces the agent's weak points — input for prompt revisions, new tools, fine-tuning datasets.

Designing the Unified Dashboard

Aggregate everything above into a dashboard operators can read at a glance. Here's the layout I run in production on Antigravity:

Top row: three to four scorecards showing health metrics. Sessions in the last hour, error rate, average latency, accumulated cost. System state in one glance.

Middle: time-series charts. Per-agent invocation count, per-tool success rate, per-model token consumption. Stacking them vertically lets you correlate anomalies across dimensions.

Bottom: a list of "recent anomalous sessions." Failed sessions, sessions with unusual latency, expensive sessions — with links that drill down to full traces.

The key is enabling top-down investigation. Check overall health, find the time window of anomalies in time series, drill into individual sessions for trace inspection. A dashboard that supports this flow naturally cuts operational load dramatically.

Sampling Strategy: 100% Isn't Realistic

I've been writing "record everything," but at production scale that's not realistic. Trace storage alone can run to thousands of dollars monthly. A practical sampling strategy:

For the first two weeks in production, record everything. With sparse data, gaps make root-cause analysis impossible.

Once steady state, thin the data this way: error sessions get 100% (rare but high-value); top 5% by cost get 100% (directly relevant to cost optimization); everything else gets 10–20%. Combine with a rolling strategy: every new release, snap back to 100% for 48 hours.

A Phased Implementation Roadmap

A realistic order for implementation:

Week one: focus only on structured traces. Operational visibility improves dramatically with this alone. Week two: collect basic metrics (invocations, errors, latency, cost).

Month two: build the outcome-log auto-evaluation pipeline. This takes the longest but pays the highest dividends. Finally: wire up unified dashboards and alerting.

Each phase produces visible improvement, making it easy to maintain organizational momentum.

Your Next Step

After reading this, the immediate move is to instrument one of your production agents with structured tracing. Just one is fine. Watch the traces accumulate for a week.

You'll spot things like "this agent is calling the same tool repeatedly," "this code path almost always fails," "this user session is anomalously long." Improvement ideas emerge only when the data is in front of you. Observability is an investment, but the returns are real.

An Observability Blueprint for Antigravity Agents in Production

An Observability Blueprint for Antigravity Agents in Production

Why AI Agent Observability Is a Different Beast

Pillar One: Structured Traces to Reconstruct Thought

Pillar Two: Action Metrics for Early Anomaly Detection

Pillar Three: Outcome Logs for Continuous Quality Improvement

Designing the Unified Dashboard

Sampling Strategy: 100% Isn't Realistic

A Phased Implementation Roadmap

Your Next Step

Thank You for Reading

Related Articles

Related Articles

◉ Antigravity2026-06-03
Why Antigravity Agent Edits Fail With 'patch does not apply' and How to Fix It
Why Antigravity agent edits stall with 'patch does not apply' or 'hunk failed', and how to fix it. Focused on the race where the file changes after the agent reads it, with the settings that stop it from recurring.

◉ Antigravity2026-04-22
Cutting Antigravity Agent Costs in Half Without Sacrificing Quality — A Practical Optimization Playbook
Running Antigravity agents full-time can drive your API bill up fast in the first month. In my own production setup I managed to cut monthly token consumption almost in half while keeping output quality identical. Here's exactly where the waste was and how I redesigned around it.

◉ Antigravity2026-06-18
Choosing Among Desktop, CLI, SDK, and Managed Agents for the Same Job
Antigravity 2.0 has several surfaces: desktop, CLI, SDK, and the Managed Agents API. Which one should run a given task? Here is a framework for choosing the surface from the nature of the work.