Turning 'the Antigravity CLI feels faster' into a number with hyperfine
The Go-based Antigravity CLI feels snappier to start. Here is how to turn that impression into a reproducible number with hyperfine: warm vs cold runs, cumulative cost in automation, and a CI gate that catches regressions.
When Gemini CLI and the Code Assist extension wind down on June 18, I moved my whole scheduled stack off gemini and onto the Antigravity CLI. The first thing I noticed was vague: it felt quicker to hand control back to the terminal. I half-expected that, since the tool is now a single Go binary. But an impression is just an impression. As an indie developer automating updates across four sites (Dolice Labs), the CLI gets launched dozens of times a day. If startup shifts by 0.4 seconds, that is not noise at my volume. I wanted to convert the feeling into a number anyone can reproduce.
Why time antigravity --version doesn't really measure anything
The first instinct is usually this naive measurement:
time antigravity --version# antigravity 2.0.3# antigravity --version 0.18s user 0.04s system 88% cpu 0.249 total
That 0.249 s blends several unrelated things: the shell resolving PATH and spawning the process, the OS reading the binary off disk (slow on the very first run), and the trivial work of printing a version string. Worse, it is a single sample. Run it again and the OS file cache is warm, so it returns in 0.08 s. Which one is the "real" startup time?
The honest answer is that you need both, depending on intent. For scheduled jobs that hammer the same binary back to back, the warm median is closest to reality. For a tool you rarely launch, or the very first invocation after a deploy, the cold path is what bites. Naive time makes neither distinction and takes one sample, so it isn't a measurement at all.
Capturing the warm distribution with hyperfine
That is what hyperfine is for. It is a Rust benchmarking tool that handles warmup runs, repeated sampling, statistics, and outlier warnings for you.
--warmup 5 runs the command five times before measuring to warm the cache, and --runs 50 takes fifty samples. On my machine (M2 / macOS 15) the result looked like this. These are environment-dependent reference numbers — read the ratio, not the absolute values.
Benchmark 1: antigravity --version
Time (mean ± σ): 28.4 ms ± 2.1 ms
Range (min … max): 25.9 ms … 34.7 ms 50 runs
Benchmark 2: gemini --version
Time (mean ± σ): 186.3 ms ± 9.4 ms
Range (min … max): 171.2 ms … 208.5 ms 50 runs
Summary
'antigravity --version' ran 6.56 ± 0.61 times faster than 'gemini --version'
Now "feels faster" is roughly 6.5x. That is a believable gap. A Node-based CLI boots a Node runtime and loads its dependency modules on every launch; a single Go binary carries almost none of that overhead. Because we measured the trivial --version path, what we are seeing here is pure process-start cost.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦How to measure CLI startup latency in warm vs cold conditions and separate it from model response time
✦How to estimate the real cumulative cost of startup time when automation invokes the CLI dozens of times a day
✦How to turn a vague 'it feels faster' into a reproducible benchmark and a p95-threshold CI gate
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Warm runs model back-to-back execution, but the first launch after a deploy or reboot is a different animal. Truly flushing the file cache on macOS is hard (purge is not complete), so I take the other route: make the comparison fair by giving both commands the same --prepare step, inserting an identical "pseudo-cooldown" right before each measurement.
--prepare runs before each measured run and its time is excluded from the result. That imposes the same "pause first" condition on both tools equally. It is not a true cold start, but at least it removes the unfairness of one binary being warm while the other is not. One gotcha: on the first launch, macOS Gatekeeper may verify the binary and steal a few hundred milliseconds. That is the OS, not the tool, so strip the quarantine attribute with xattr -d com.apple.quarantine $(which antigravity) before measuring and the variance settles down.
Don't confuse startup cost with the cost of actually running an agent
This is the biggest trap. Even if antigravity --version is fast, a real run like antigravity run "refactor this" is dominated by the network round trip to the model, and the fast startup barely shows. Startup latency only pays off for workloads that launch the CLI often but do light work each time.
To keep them separate, I split what I measure into three layers.
Layer 1: pure startup cost
Just --version, no network. This is where the Go binary's advantage shows most directly.
Layer 2: light local-only work
Reading config or listing slash commands — work that finishes on disk and CPU alone.
Layer 3: with a model round trip (treat as reference)
Real agent runs are network-bound, high-variance, and the startup difference drowns in noise.
# 1) pure startup cost (no network)hyperfine --warmup 5 'antigravity --version'# 2) light local-only work (config load, slash command list, etc.)hyperfine --warmup 3 'antigravity --help'# 3) with model round trip (reference; high variance, startup delta is buried)hyperfine --runs 5 'antigravity run --quiet "say ok"'
Layer 3 has enormous variance and must never be used as evidence that "the CLI is fast or slow." Talking about startup while actually measuring model congestion is a mix-up that happens constantly. If you want to talk about startup, base it on layers 1 and 2 only.
Estimating the cumulative cost in automation
In personal-dev automation, a per-run difference that looks small adds up by frequency. My scheduled jobs launch the CLI roughly 60 times a day across four sites once you count pre/post-build checks and log collection. If the warm-median difference is, say, 0.16 s:
0.16 s x 60 runs/day = 9.6 s/day
9.6 s x 30 days ≈ 4.8 min/month
That is just under five minutes a month. Whether you read that as "rounding error" or "accumulating waste" depends on your scale. Pipelines that call the CLI many times per CI step, or hooks that launch it on every commit, push that multiplier up by an order of magnitude, and startup latency becomes a cost you can't ignore. Conversely, if you only invoke it a few times a day, the same arithmetic says the difference is practically meaningless — in that case I recommend leaving startup optimization alone and spending the time where it actually moves the needle. The practical rule here is to reason about count x delta, not about how it feels.
Keep results as JSON and stop regressions in CI
Measuring once and moving on is not enough; you want to keep watching that an upgrade doesn't make it slower. hyperfine emits machine-readable results with --export-json.
With that you can write a small gate that fails CI when p95 crosses a threshold. Watch the tail, not the average — what wrecks the experience is not the mean but the occasional slow run.
#!/usr/bin/env bash# bench-gate.sh — fail (exit 1) if startup latency exceeds the budgetset -euo pipefailTHRESHOLD_MS=80 # treat anything above this as a regressionhyperfine --warmup 5 --runs 50 \ --export-json bench.json \ 'antigravity --version' >/dev/null# compute p95 from the times array (watch the tail, not the mean)P95_MS=$(python3 - <<'PY'import jsontimes = sorted(json.load(open("bench.json"))["results"][0]["times"])p95 = times[int(len(times) * 0.95) - 1] * 1000print(round(p95, 1))PY)echo "p95 startup: ${P95_MS} ms (threshold ${THRESHOLD_MS} ms)"if awk "BEGIN { exit !(${P95_MS} > ${THRESHOLD_MS}) }"; then echo "🛑 startup latency regressed" exit 1fiecho "✅ startup latency within budget"
Drop this gate into whatever workflow updates the CLI and you will automatically catch the day startup quietly got heavier. Start the threshold a bit above your environment's warm median (I use about 2.5x the median) and tighten it as you go.
Run hyperfine --warmup 5 'antigravity --version' once on your own machine first and get your own number. Once a feeling becomes a figure, it gets surprisingly easy to see clearly where speed is worth chasing — and whether it is worth chasing at all.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.