⚙ AI Tools/2026-06-17Advanced

Your Antigravity LLM App Drifts on Cost and Quality While the Dashboard Stays Green — Instrumentation Field Notes

Watching only total cost and latency hides the slow drifts that hurt. These are field notes on attributing telemetry by feature, tenant, and prompt version so you catch quality regressions and cost spikes early.

llmops² observability¹⁵ opentelemetry² cost³ quality⁴ antigravity³⁶⁷ production⁶⁵

✦ Premium Article

A few weeks after putting an Antigravity-built agent into production, my Grafana dashboard stayed green the whole time. Total cost was smooth, P95 latency sat inside the threshold, the error rate was essentially zero. And yet the month-end bill came in at 1.6× my estimate, and one feature's answers had visibly gotten sloppier.

Totals dissolve problems into their average. When you're a solo or indie developer running several features through a single API key, that dissolving property bites especially hard. Here I want to write down — in the order I actually rebuilt them — the instrumentation pieces that let me catch the two drifts that move underneath the totals: the billing drift and the quality drift.

A total tells you how much, not who is eating it

The first monitoring I built just summed token counts and cost by model. That answers "how much did we spend," but not "where did it grow." In production, cost skews hard by feature and by tenant. The summarize feature might eat 60% of the total; one particular customer might burn 20× the average. The aggregate graph flattens all of that out.

So I changed the unit of instrumentation from "model" to "feature × tenant × prompt version." Every OpenTelemetry span and metric carries those three as attributes. As long as the attributes are present, you can pivot to any slice later in PromQL.

# llm_telemetry.py
import time
import anthropic
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
 
_tp = TracerProvider()
_tp.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317")))
trace.set_tracer_provider(_tp)
metrics.set_meter_provider(MeterProvider(metric_readers=[
    PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint="http://otel-collector:4317"),
        export_interval_millis=30_000,
    )
]))
tracer = trace.get_tracer("llm-app")
meter = metrics.get_meter("llm-app")
 
tokens = meter.create_counter("llm.tokens", description="tokens by direction")
cost = meter.create_counter("llm.cost_usd", unit="USD", description="API cost in USD")
latency = meter.create_histogram("llm.latency_ms", unit="ms", description="end-to-end latency")
errors = meter.create_counter("llm.errors", description="error count by type")
 
# Prices are $/1M tokens. Revisit on every model change (see below).
MODEL_PRICES = {
    "claude-sonnet-4-6": {"in": 3.0, "out": 15.0},
    "claude-haiku-4-5-20251001": {"in": 0.25, "out": 1.25},
    "claude-opus-4-8": {"in": 5.0, "out": 25.0},
}
 
class TelemetryClient:
    """Wrapper that attributes telemetry by feature, tenant, and prompt version."""
 
    def __init__(self):
        self._client = anthropic.Anthropic()
 
    def call(self, *, model, messages, feature, tenant, prompt_version,
             max_tokens=2048, system=None, role="product"):
        # role="product" is a live response; role="eval" is quality scoring
        # (the key that keeps the two budgets separate).
        attrs = {
            "feature": feature, "tenant": tenant,
            "prompt_version": prompt_version, "model": model, "role": role,
        }
        with tracer.start_as_current_span("llm.call") as span:
            for k, v in attrs.items():
                span.set_attribute(k, v)
            t0 = time.monotonic()
            try:
                kwargs = {"model": model, "max_tokens": max_tokens, "messages": messages}
                if system:
                    kwargs["system"] = system
                res = self._client.messages.create(**kwargs)
                ms = (time.monotonic() - t0) * 1000
                tin, tout = res.usage.input_tokens, res.usage.output_tokens
                price = MODEL_PRICES.get(model, {"in": 3.0, "out": 15.0})
                usd = tin / 1e6 * price["in"] + tout / 1e6 * price["out"]
 
                tokens.add(tin, {**attrs, "direction": "in"})
                tokens.add(tout, {**attrs, "direction": "out"})
                cost.add(usd, attrs)
                latency.record(ms, attrs)
                span.set_attribute("llm.cost_usd", usd)
                span.set_attribute("llm.latency_ms", ms)
                return {"text": res.content[0].text, "cost_usd": usd,
                        "latency_ms": ms, "in": tin, "out": tout}
            except Exception as e:
                errors.add(1, {**attrs, "error_type": type(e).__name__})
                span.record_exception(e)
                span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
                raise

I include prompt_version as an attribute because it pays off in the quality tracking later. I want to be able to cut cost and quality at the version boundary — to see how both moved the moment I changed a prompt. The version string can be a hash of the prompt template or a short hand-assigned identifier like chat-v7. What matters is that "when did I change what" survives as an axis on the metrics.

With attributes in place, the queries that expand cost by feature or tenant are straightforward to write.

# Cost per feature over the last hour (which feature is eating the budget)
sum(increase(llm_cost_usd_total{role="product"}[1h])) by (feature)
 
# Top tenant consumers (spot a single customer running away)
topk(5, sum(increase(llm_cost_usd_total[24h])) by (tenant))

Eval-model spend sneaks into production cost

To score quality automatically with an LLM-as-a-Judge, the evaluation itself calls the API. At first I didn't separate this, so the eval spend rode on top of my production cost graph, and I couldn't tell whether "production is expensive" or "I'm running too many evaluations." That's why I split it out with role="eval". Evaluation runs on a cheap model, and in PromQL I look at production cost alone with role="product".

Scoring every request inflates the eval budget on its own, so I made it a rolling sample. Instead of grading everything every minute, I pull a fixed fraction per feature and evaluate that. This keeps eval spend at a few percent of production cost while still capturing trend changes.

# quality_sampler.py
import json, random
from dataclasses import dataclass
 
@dataclass
class Score:
    relevance: float
    grounded: float        # is it grounded in the reference (for RAG)
    overall: float
    note: str
 
class QualitySampler:
    """Rolling-sample scoring, aggregated by prompt version."""
 
    def __init__(self, telemetry, sample_rate=0.05, eval_model="claude-haiku-4-5-20251001"):
        self.t = telemetry
        self.rate = sample_rate
        self.eval_model = eval_model
 
    def maybe_score(self, *, question, answer, context, feature, tenant, prompt_version):
        if random.random() > self.rate:
            return None  # not sampled this time
        ctx = f"\n\nReference:\n{context}" if context else ""
        prompt = (
            "Grade the following answer as JSON only. "
            "Each field is 0.0-1.0.\n"
            f"Question: {question}{ctx}\n\nAnswer: {answer}\n\n"
            '{"relevance": fit, "grounded": faithfulness to reference, '
            '"overall": overall, "note": "one-sentence remark"}'
        )
        res = self.t.call(
            model=self.eval_model,
            messages=[{"role": "user", "content": prompt}],
            feature=feature, tenant=tenant, prompt_version=prompt_version,
            role="eval", max_tokens=300,
        )
        data = json.loads(res["text"])
        score = Score(**data)
        # Push quality into metrics too, so it can be tracked per version.
        from opentelemetry import metrics
        gauge = metrics.get_meter("llm-app").create_histogram("llm.quality_overall")
        gauge.record(score.overall, {"feature": feature, "prompt_version": prompt_version})
        return score

Aggregating scores by prompt_version makes the start of a quality regression appear right at the version boundary. That becomes the foundation for catching the nastiest drift of all.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦An instrumentation pattern that attributes spans and metrics by feature, tenant, and prompt version to surface cost skew hidden inside totals

✦A rolling-sample evaluation design that catches quality regressions in an era when default models silently get upgraded

✦How eval-model spend sneaks into production cost, and a cost ledger that stops per-tenant spikes

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

The silent quality regression in an era of auto-upgraded default models

Since the start of 2026, platforms have increasingly raised their default models quietly. In the Gemini API, the default model for search and agents got swapped for a newer version; in Antigravity, the model used under the hood shifts between releases. Convenient — but it means that even though you changed nothing in your prompts, the output's behavior can start shifting one day. Usually it's an improvement, but sometimes the way your fine-grained instructions land changes, and one specific feature loses accuracy.

This regression doesn't throw errors, so the error-rate alert won't catch it. Latency barely moves. The only thing that catches it is the quality metric. So I track a moving average of scored quality per version and per feature, and raise a warning when it drops more than a threshold against the preceding window.

# quality_drift.py
from collections import deque, defaultdict
import statistics
 
class QualityDrift:
    """Hold a moving window of quality scores per feature and detect regressions."""
 
    def __init__(self, window=200, drop_threshold=0.08):
        self.window = window
        self.drop = drop_threshold
        self.recent = defaultdict(lambda: deque(maxlen=window))
 
    def observe(self, feature, overall):
        self.recent[feature].append(overall)
 
    def check(self, feature):
        buf = self.recent[feature]
        if len(buf) < self.window:
            return None  # not enough to judge yet
        half = self.window // 2
        prev = statistics.mean(list(buf)[:half])
        now = statistics.mean(list(buf)[half:])
        if prev - now >= self.drop:
            return {
                "type": "QUALITY_DRIFT",
                "feature": feature,
                "from": round(prev, 3),
                "to": round(now, 3),
                "message": f"{feature} quality dropped {prev:.2f}->{now:.2f}; "
                           "suspect default-model change, prompt edit, or input drift",
            }
        return None

When a warning fires, I check three things first. Did the default model change (the platform's release notes)? Did I touch the prompt (the prompt_version boundary)? Did the input distribution shift (a new tenant using it in an unexpected way)? For isolating the cause, overlaying the quality-drop window with the prompt_version metric is the fastest move, in my experience. If the score fell while the version stayed the same, you can read it as "something moved outside my control."

What I hit in production was exactly this shape. The prompt version stayed fixed, yet the summarize feature's score crept down over two weeks. The cause was a default-model swap underneath, which weakened how my "don't use bullet points" instruction landed. Once I pinned the version and re-specified the model explicitly, the score returned to normal. If I hadn't been cutting quality metrics by version, I wouldn't have noticed until the month-end bill and the readers' experience had both gone sour.

Stop per-tenant spikes with a cost ledger

An aggregate cost alert is deaf to per-tenant runaway. One customer can be burning dozens of times the average while the whole stays inside the threshold. So I record each tenant's consumption in a ledger and warn on each tenant's deviation from its own normal range. The threshold isn't the global average — it's that tenant's own past distribution.

# cost_ledger.py
from collections import defaultdict
from datetime import datetime, timedelta
import statistics
 
class CostLedger:
    """Record per-tenant consumption and detect deviation from its own normal range."""
 
    def __init__(self, window_hours=24, sigma=3.0):
        self.window = timedelta(hours=window_hours)
        self.sigma = sigma
        self.hourly = defaultdict(lambda: defaultdict(float))  # tenant -> hour -> usd
 
    def record(self, tenant, usd):
        now = datetime.now()
        self.hourly[tenant][now.replace(minute=0, second=0, microsecond=0)] += usd
 
    def check(self, tenant):
        hours = sorted(self.hourly[tenant].items())
        if len(hours) < 12:
            return None  # history too short to build a baseline
        past = [v for _, v in hours[:-1]]
        current = hours[-1][1]
        mean, sd = statistics.mean(past), statistics.pstdev(past)
        if sd > 0 and current > mean + self.sigma * sd:
            return {
                "type": "TENANT_COST_SPIKE",
                "tenant": tenant,
                "current_usd": round(current, 4),
                "baseline_usd": round(mean, 4),
                "message": f"{tenant} hourly spend left its normal range ({current:.3f} vs mean {mean:.3f})",
            }
        return None

Once you can see things per tenant, you have more options. If it's a bug looping, stop it; if it's legitimate heavy use, turn it into a usage-billing conversation. Either way, you can move far earlier than discovering it on the month-end bill. In setups like Antigravity's Managed Agents, where agents run on scheduled triggers, runaway is likeliest during unattended hours — so this ledger earns its keep there in particular.

The order to add instrumentation

You don't need all of it at once. Get the order right and a little effort widens what you can see. I'd add attributed instrumentation first (feature, tenant, prompt version) so the skew under the totals becomes visible. That alone makes the cost conversation a notch more concrete. Next, separate eval from production and add rolling-sample scoring so quality can be tracked by version. Last, add the tenant ledger to stop individual spikes.

The value of observability isn't the number of graphs — it's whether a human can trace a problem to its cause when one happens. With the attributes in place, even the quiet drift moving behind a green dashboard is within reach of the numbers. I hope this helps anyone else running several features through a single key.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.