Have you ever spent a week wiring up a slick Antigravity AI agent, only to pause the night before launch and wonder, "wait, how many LLM calls is this thing actually making per request?" I have. The agent looked great in demos, but the moment I put real users in front of it, I discovered it was firing the model thirty times per request. The only reason I caught it was because the bill started climbing, not because I had a dashboard.
Observability for AI agents is the kind of work that never wins a demo. It is also the thing you regret not doing, every single time. In this guide I will walk through how I set up Langfuse on top of an Antigravity agent to get traces, token usage, and cost visibility to a level that keeps me sane in production — without overengineering the stack.
Why AI Agent Observability Keeps Getting Deprioritized
General-purpose observability stacks like OpenTelemetry, covered in the OpenTelemetry AI observability pipeline guide, are great at infrastructure-level questions: request rates, CPU load, latency percentiles. But AI agents introduce a second layer of questions that those stacks simply were not built for.
What you really want to know as an AI developer is: which prompt consumed how many tokens, which tool call took how long, what was the final output, and how good was it? Retrofitting that information after the fact usually means rebuilding your log format from scratch. Langfuse is an open-source tracing backend purpose-built for exactly this kind of question. You can run it self-hosted or use the cloud version, and the API is identical, which is why I have made it my default for new Antigravity projects.
Another reason observability gets skipped: most of us optimize for "the agent works" first, then move on to the next feature. By the time the agent has grown into a small web of tool calls, Retrieval-Augmented Generation, and conditional branches, peeking inside is genuinely hard. Retrofitting instrumentation at that stage feels expensive, which is why it keeps getting pushed to next sprint, and then the sprint after that. Starting lightweight from day one — even a single @observe decorator — removes that barrier almost entirely.
An Honest Case for Choosing Langfuse
The AI observability space is crowded right now. LangSmith, Helicone, Arize, and Phoenix all have their own strengths. Here is the short version of why I keep reaching for Langfuse in my own projects.
- Self-hostable with the same API as the cloud version. I can prototype on the cloud plan and ship client work on a private instance without touching the code.
- Clean Python and TypeScript SDKs. When I ask the Antigravity agent to generate Langfuse integration code, the output almost always runs as-is.
- Three concepts to understand: traces, generations, and scores. The low learning curve matters a lot when I need to bring a small team on board.
- Evaluation features tie back to traces. I can promote a production trace into an evaluation dataset without leaving the tool, which keeps the "observe → evaluate → improve" loop tight.
- Pricing that plays well with solo developers. The free tier is generous enough to ship a small product on, and the paid tiers scale with trace volume rather than seats. That matters when you are running five side projects at once.
If you also need SLO management and full infrastructure monitoring, pair Langfuse with the Prometheus + Grafana monitoring stack. I treat Langfuse as "what is happening inside the AI" and Prometheus as "is the application healthy overall," and the two live happily side by side.
Getting Langfuse Running on Antigravity in Fifteen Minutes
Langfuse requires one package install and three environment variables. When I ask the Antigravity agent for a minimal Google Gen AI SDK + Langfuse example, it returns roughly this code, which is the exact shape I use as a starting point.
# app/observed_agent.py — Minimal example: trace Gen AI SDK calls with Langfuse
import os
from google import genai
from langfuse.decorators import observe, langfuse_context
# Env vars: LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_HOST
client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])
@observe(as_type="generation")
def answer_question(question: str) -> str:
"""Answer a user question with Gemini. @observe auto-captures the trace."""
model = "gemini-2.5-flash"
# Tell Langfuse which model we're invoking
langfuse_context.update_current_observation(
model=model,
input={"question": question},
)
response = client.models.generate_content(model=model, contents=question)
output_text = response.text
# Hand off token usage + output to Langfuse
langfuse_context.update_current_observation(
output={"answer": output_text},
usage={
"input": response.usage_metadata.prompt_token_count,
"output": response.usage_metadata.candidates_token_count,
"unit": "TOKENS",
},
)
return output_text
if __name__ == "__main__":
print(answer_question("Summarize what makes Antigravity distinct, in three lines."))Run this and the Langfuse dashboard fills up with one row per call — which prompt went in, how many tokens it burned, how long it took. Grab a project API key from the dashboard, drop it into your .env, and you are done.
One thing worth noting: Langfuse batches events in the background, so you will sometimes see a slight delay before new traces appear. On short-lived scripts, call langfuse_context.flush() at the end to make sure nothing is dropped on exit. I lost about an hour once debugging "missing traces" that turned out to be a script finishing before the background thread had flushed.
To take the snippet above from demo to staging, layer on the retry patterns from keeping your Antigravity Python API up under real load. The two designs compose cleanly: retries keep your agent alive during transient failures, and Langfuse records the retry behavior so you can tell healthy retries apart from runaway loops.
Three Rules of Thumb for Useful Trace Design
Instrumentation starts the moment you install the SDK, but making those traces useful a week later takes a little discipline. These are the three rules I never break on Antigravity projects.
- One user request equals one trace. Every tool call and LLM call should nest under the same trace ID. When a user reports a weird output, you want to see the whole story on a single screen.
- Break out tool calls as spans. Database hits, external APIs, and LLM invocations each get their own span. Langfuse's Gantt-style view really pays off here — bottlenecks become obvious at a glance.
- Always tag user ID and session ID. Future-you will want to answer "what does this cost for premium users?" or "can you reproduce that specific session?" Without tags, the answer is always "no." Hashed IDs are fine; you do not need to expose raw PII.
Internalize these three early and your dashboards will scale as your agent catalog grows instead of turning into noise. I have a small helper module that wraps Langfuse's session start with project-wide tags — model version, feature flag state, experiment group — and it has paid for itself many times over during post-incident reviews.
What I Actually Read on the Dashboard Every Day
Langfuse's dashboard is dense, but on my own projects I really only watch three numbers day to day.
- Average tokens per trace. If this drifts upward over a week, something in the prompt or retrieval layer is getting fatter. I pair this view with the agent cost optimization guide on halving tokens to attack the worst offender first.
- p95 latency. A healthy median hides pain. Perceived quality is almost always set by the p95, not the p50.
- Error rate at the observation level. You will never get to zero, so pick your acceptable ceiling up front. That way you do not numb yourself with alert fatigue when something that does not matter goes red.
There is one subtle trap in these three numbers that took me months to spot: they all move together when you change models. Switching from Gemini 2.5 Pro to 2.5 Flash will drop cost and latency while nudging the error rate if the smaller model struggles on edge cases. When you change a model, reset your mental baseline for all three metrics rather than celebrating a cost drop that actually came with a quality dip.
Trying to look at everything at once is how dashboards get abandoned. Pick three numbers, glance at them every morning for a minute, and you will learn your own system. After a few weeks you start to notice the shape of a typical day — the morning latency spike when Gemini is warming up, the steady token cost during the afternoon, the late-night burst from international users. That intuition becomes a debugging superpower.
One Step You Can Take Today
Instrumenting everything at once almost always stalls. Instead, pick the single Antigravity agent script you use most often, wrap one function with @observe, and spend ten minutes staring at the resulting dashboard. The first time the numbers surprise you — "this agent is quieter than I thought" or "this agent is way louder than I assumed" — is the moment observability earns a permanent spot in your workflow.
From there you can grow the instrumentation alongside the project. A one-minute morning ritual beats a perfect instrumentation plan every time, and once the habit sticks you will wonder how you ever shipped without it.