⚙ AI Tools/2026-07-01Advanced

Measure Before You Trim: A Context Ledger for Antigravity CLI Token and Latency Costs

Prompted by the ~70% token reduction reported for the Android CLI agent, I built a thin wrapper and a weekly review to measure my own agent runs. Here is how I replaced whole-file context with line ranges and cut wait times.

Antigravity CLI¹¹ Token Reduction Context Design Measurement Indie Development⁵

✦ Premium Article

A recent update reported that an agent running on the Android command line completed tasks about three times faster and used roughly 70% fewer tokens than standard tooling. The speed is genuinely welcome, but staring at that number made me realize something uncomfortable: I had no idea how much context my own agent runs were actually sending, or how long I was waiting. A story about reduction only means something once you can measure where you stand today.

As an indie developer I run several apps and several blogs in parallel, and I throw small fix-it tasks at the Antigravity CLI many times a day. I had a vague feeling that "this task is oddly slow" or "my quota is draining fast today," but I couldn't put a number on any of it. So I worked in a deliberate order: lay a thin measurement layer over the CLI first, then trim. These are my notes from doing exactly that.

What Do You Measure to See the Waste?

The most reliable source of exact token usage is the CLI's quota screen or usage report. The trouble is that those numbers tend to be a session-wide total rather than per-task, so they can't tell you which task is heavy.

So I settled on two proxy metrics I can fully control.

Wall-clock seconds

This maps directly to how it feels. Simply recording the real time until the agent returns already makes the weight difference between tasks obvious.

Bytes of explicitly passed context

I can't fully control what semantic search picks up behind the scenes, but I can count the bytes of the files and ranges I hand over by hand precisely. It isn't tokens, but it's more than enough to spot when I'm shipping a needlessly large input.

I append these two values, along with a tag for the task, one line at a time. That tiny ledger became my starting point.

A Thin Wrapper That Records One Task

Instead of calling the CLI directly, I route calls through a wrapper that writes the time and context size before and after. Adjust the command and flags to match your version. What matters is that one line gets recorded automatically on every call, without my having to ask for it.

#!/usr/bin/env bash
# agy-run.sh — a thin wrapper that records wait time and context size around the Antigravity CLI
set -euo pipefail
 
LEDGER="${HOME}/.agy/ledger.csv"
mkdir -p "$(dirname "$LEDGER")"
[ -f "$LEDGER" ] || echo "ts,tag,context_bytes,context_files,seconds,status" > "$LEDGER"
 
TAG="${1:?usage: agy-run <tag> <prompt-file> [context files or ranges...]}"
PROMPT_FILE="${2:?prompt file required}"
shift 2
CTX=("$@")
 
# Total bytes of explicitly passed context (a proxy I can control).
# For a range like "path:120-164", approximate using just the file name.
ctx_bytes=0
for item in "${CTX[@]}"; do
  f="${item%%:*}"
  [ -f "$f" ] && ctx_bytes=$(( ctx_bytes + $(wc -c < "$f") ))
done
 
start=$(date +%s)
# The real call. Only what you pass via --context is used for grounding.
agy run --prompt "$PROMPT_FILE" ${CTX[@]:+--context "${CTX[@]}"}
status=$?
end=$(date +%s)
 
printf '%s,%s,%d,%d,%d,%d\n' \
  "$(date -u +%FT%TZ)" "$TAG" "$ctx_bytes" "${#CTX[@]}" "$(( end - start ))" "$status" \
  >> "$LEDGER"
 
exit "$status"

Three points matter here. First, set -euo pipefail keeps a mid-run failure from being silently swallowed. Second, recording the exit code too lets me later filter out tasks that were "fast but actually failed." Third, when I pass a range like path:120-164, the byte count is approximated from the file name only. If you want exact range counts, pipe sed -n output into wc -c, but I chose a rough form I could actually keep up with.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How to build a ~60-line wrapper that records per-task wait time and context size for the Antigravity CLI into a CSV

✦The exact steps that cut my wait time from about 70 seconds to around 25 by swapping whole-file context for @path:line-range plus semantic search

✦The ledger_report aggregation logic for a weekly token audit of unattended runs, such as an AdMob-connected app

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Auditing the Ledger Weekly

A CSV that accumulates one line at a time only becomes useful once you aggregate it by tag. Once a week I sort tasks by weight with the following script and take a look.

#!/usr/bin/env python3
"""ledger_report.py — aggregate the ledger.csv written by agy-run.sh, grouped by tag."""
import csv
import statistics
import sys
from collections import defaultdict
from pathlib import Path
 
path = Path(sys.argv[1]) if len(sys.argv) > 1 else Path.home() / ".agy/ledger.csv"
rows = list(csv.DictReader(path.open(encoding="utf-8")))
 
by_tag = defaultdict(lambda: {"sec": [], "kb": [], "fail": 0})
for r in rows:
    by_tag[r["tag"]]["sec"].append(int(r["seconds"]))
    by_tag[r["tag"]]["kb"].append(int(r["context_bytes"]) / 1024)
    if r.get("status", "0") != "0":
        by_tag[r["tag"]]["fail"] += 1
 
print(f"{'tag':30}{'runs':>5}{'median_s':>10}{'ctxKB':>9}{'fail':>6}")
for tag, d in sorted(by_tag.items(), key=lambda x: -statistics.median(x[1]["sec"])):
    print(f"{tag:30}{len(d['sec']):>5}"
          f"{statistics.median(d['sec']):>10.0f}"
          f"{statistics.median(d['kb']):>9.1f}"
          f"{d['fail']:>6}")

I sort by the median wait time because I want to fix tasks that are always heavy before chasing the occasional outlier. If a tag with an outsized context KB sits near the top, that's where to trim. In my case, the top of the list was entirely "passed the whole file" tasks.

Pass a Range, Not the Whole File

Looking inside the heavy tasks, the cause was simple. I wanted to change one spot, yet I was loading the entire file into context every time. Out of a 1,200-line page component, only a few dozen lines mattered.

# Bad: pass the whole file (1,200 unrelated lines go along for the ride)
agy-run fix-paywall prompt.md src/app/article/page.tsx
 
# Good: pass only the relevant range and let semantic search handle the rest
agy-run fix-paywall prompt.md "src/app/article/page.tsx:120-164"

On my machine, a task that passed the page component in full (about 38KB, around 1,200 lines) sat at roughly 70 seconds. After narrowing the passed range to lines 120-164 (about 1.5KB), it dropped to around 25 seconds. When the felt sense of a 2-3x speedup lined up cleanly with the ledger numbers, I was glad I had bothered to measure.

I worried at first that ranges would make responses sloppy, but the opposite happened. With no unrelated code mixed in, the agent stopped second-guessing where to make the change, and off-target diffs went down. Semantic search reliably pulls in definitions needed outside the range.

The Pitfall: Bytes Are Not Tokens

One caution. What the wrapper counts is "the bytes I explicitly passed," not tokens, and not the total that semantic search read behind the scenes. Treat this proxy as an absolute token count and you'll draw the wrong conclusions.

What actually tripped me up was a task whose wait time barely moved even after I narrowed the context. Digging in, semantic search was pulling in a large generated file (a build artifact) as "relevant." Shrinking the input does nothing if the search side is bloated. As a fix, I added the directory I wanted out to an exclusion config and kept generated output out of grounding. Reconciling the proxy against the real numbers on the quota screen just once a month is enough to catch this kind of drift early.

In other words, this ledger isn't "the truth" — it's a map for making good guesses. Used as a map, and only as a map, it earns its keep many times over.

Where It Pays Off Most: Unattended Runs

What I appreciate most about this setup is its effect on tasks that run automatically while no one is watching — revenue rollups for an AdMob-connected app, or the periodic article-integrity checks for my blogs. These unattended runs are small individually but frequent, and wasted context quietly stacks up and drains the quota.

If your automation scripts also call through agy-run.sh, the ledger captures them at the same granularity as hands-on tasks. When the weekly report let me notice that "the context for a nightly task has doubled since last month," I could trace it to a reference file that had ballooned and act before it disrupted production. For tasks where failure ripples into later stages — like pre-submission checks for the App Store or Google Play — recording the exit code is well worth it.

As a rule of thumb, I keep a loose line: "if a tag's median exceeds 30 seconds and its context exceeds 10KB, consider switching to a range." More than the specific numbers, I'd recommend picking one threshold and measuring against it consistently. Move the line to fit your own workflow.

The Next Step

Drop in a single agy-run.sh and route your usual tasks through it for just one day. Once a day's worth of ledger accumulates, which of your tasks are heavy becomes visible as numbers rather than a hunch. Trimming can wait until that map exists.

Since I started treating measuring and trimming as separate steps, the way I write instructions for the agent has felt a little calmer. I hope it helps anyone juggling several projects the same way.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.