Articles/Tips & Best Practices

✦ Tips & Best Practices/2026-06-14Intermediate

Turning 'the Antigravity CLI feels faster' into a number with hyperfine

The Go-based Antigravity CLI feels snappier to start. Here is how to turn that impression into a reproducible number with hyperfine: warm vs cold runs, cumulative cost in automation, and a CI gate that catches regressions.

antigravity-cli³ hyperfine benchmark² performance¹¹ automation³⁹

✦ Premium Article

When Gemini CLI and the Code Assist extension wind down on June 18, I moved my whole scheduled stack off gemini and onto the Antigravity CLI. The first thing I noticed was vague: it felt quicker to hand control back to the terminal. I half-expected that, since the tool is now a single Go binary. But an impression is just an impression. As an indie developer automating updates across four sites (Dolice Labs), the CLI gets launched dozens of times a day. If startup shifts by 0.4 seconds, that is not noise at my volume. I wanted to convert the feeling into a number anyone can reproduce.

Why `time antigravity --version` doesn't really measure anything

The first instinct is usually this naive measurement:

time antigravity --version
# antigravity 2.0.3
# antigravity --version  0.18s user 0.04s system 88% cpu 0.249 total

That 0.249 s blends several unrelated things: the shell resolving PATH and spawning the process, the OS reading the binary off disk (slow on the very first run), and the trivial work of printing a version string. Worse, it is a single sample. Run it again and the OS file cache is warm, so it returns in 0.08 s. Which one is the "real" startup time?

The honest answer is that you need both, depending on intent. For scheduled jobs that hammer the same binary back to back, the warm median is closest to reality. For a tool you rarely launch, or the very first invocation after a deploy, the cold path is what bites. Naive time makes neither distinction and takes one sample, so it isn't a measurement at all.

Capturing the warm distribution with hyperfine

That is what hyperfine is for. It is a Rust benchmarking tool that handles warmup runs, repeated sampling, statistics, and outlier warnings for you.

# macOS
brew install hyperfine
# Debian/Ubuntu
sudo apt install hyperfine

Start with warm startup latency:

hyperfine --warmup 5 --runs 50 \
  'antigravity --version' \
  'gemini --version'

--warmup 5 runs the command five times before measuring to warm the cache, and --runs 50 takes fifty samples. On my machine (M2 / macOS 15) the result looked like this. These are environment-dependent reference numbers — read the ratio, not the absolute values.

Benchmark 1: antigravity --version
  Time (mean ± σ):      28.4 ms ±   2.1 ms
  Range (min … max):    25.9 ms …  34.7 ms    50 runs

Benchmark 2: gemini --version
  Time (mean ± σ):     186.3 ms ±   9.4 ms
  Range (min … max):   171.2 ms … 208.5 ms    50 runs

Summary
  'antigravity --version' ran 6.56 ± 0.61 times faster than 'gemini --version'

Now "feels faster" is roughly 6.5x. That is a believable gap. A Node-based CLI boots a Node runtime and loads its dependency modules on every launch; a single Go binary carries almost none of that overhead. Because we measured the trivial --version path, what we are seeing here is pure process-start cost.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How to measure CLI startup latency in warm vs cold conditions and separate it from model response time

✦How to estimate the real cumulative cost of startup time when automation invokes the CLI dozens of times a day

✦How to turn a vague 'it feels faster' into a reproducible benchmark and a p95-threshold CI gate

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Measure cold start by cooling things down first

Warm runs model back-to-back execution, but the first launch after a deploy or reboot is a different animal. Truly flushing the file cache on macOS is hard (purge is not complete), so I take the other route: make the comparison fair by giving both commands the same --prepare step, inserting an identical "pseudo-cooldown" right before each measurement.

hyperfine --runs 20 \
  --prepare 'sync; sleep 0.3' \
  'antigravity --version' \
  'gemini --version'

--prepare runs before each measured run and its time is excluded from the result. That imposes the same "pause first" condition on both tools equally. It is not a true cold start, but at least it removes the unfairness of one binary being warm while the other is not. One gotcha: on the first launch, macOS Gatekeeper may verify the binary and steal a few hundred milliseconds. That is the OS, not the tool, so strip the quarantine attribute with xattr -d com.apple.quarantine $(which antigravity) before measuring and the variance settles down.

Don't confuse startup cost with the cost of actually running an agent

This is the biggest trap. Even if antigravity --version is fast, a real run like antigravity run "refactor this" is dominated by the network round trip to the model, and the fast startup barely shows. Startup latency only pays off for workloads that launch the CLI often but do light work each time.

To keep them separate, I split what I measure into three layers.

Layer 1: pure startup cost

Just --version, no network. This is where the Go binary's advantage shows most directly.

Layer 2: light local-only work

Reading config or listing slash commands — work that finishes on disk and CPU alone.

Layer 3: with a model round trip (treat as reference)

Real agent runs are network-bound, high-variance, and the startup difference drowns in noise.

# 1) pure startup cost (no network)
hyperfine --warmup 5 'antigravity --version'
 
# 2) light local-only work (config load, slash command list, etc.)
hyperfine --warmup 3 'antigravity --help'
 
# 3) with model round trip (reference; high variance, startup delta is buried)
hyperfine --runs 5 'antigravity run --quiet "say ok"'

Layer 3 has enormous variance and must never be used as evidence that "the CLI is fast or slow." Talking about startup while actually measuring model congestion is a mix-up that happens constantly. If you want to talk about startup, base it on layers 1 and 2 only.

Estimating the cumulative cost in automation

In personal-dev automation, a per-run difference that looks small adds up by frequency. My scheduled jobs launch the CLI roughly 60 times a day across four sites once you count pre/post-build checks and log collection. If the warm-median difference is, say, 0.16 s:

0.16 s x 60 runs/day = 9.6 s/day
9.6 s x 30 days ≈ 4.8 min/month

That is just under five minutes a month. Whether you read that as "rounding error" or "accumulating waste" depends on your scale. Pipelines that call the CLI many times per CI step, or hooks that launch it on every commit, push that multiplier up by an order of magnitude, and startup latency becomes a cost you can't ignore. Conversely, if you only invoke it a few times a day, the same arithmetic says the difference is practically meaningless — in that case I recommend leaving startup optimization alone and spending the time where it actually moves the needle. The practical rule here is to reason about count x delta, not about how it feels.

Keep results as JSON and stop regressions in CI

Measuring once and moving on is not enough; you want to keep watching that an upgrade doesn't make it slower. hyperfine emits machine-readable results with --export-json.

hyperfine --warmup 5 --runs 50 \
  --export-json bench.json \
  'antigravity --version'

With that you can write a small gate that fails CI when p95 crosses a threshold. Watch the tail, not the average — what wrecks the experience is not the mean but the occasional slow run.

#!/usr/bin/env bash
# bench-gate.sh — fail (exit 1) if startup latency exceeds the budget
set -euo pipefail
 
THRESHOLD_MS=80   # treat anything above this as a regression
 
hyperfine --warmup 5 --runs 50 \
  --export-json bench.json \
  'antigravity --version' >/dev/null
 
# compute p95 from the times array (watch the tail, not the mean)
P95_MS=$(python3 - <<'PY'
import json
times = sorted(json.load(open("bench.json"))["results"][0]["times"])
p95 = times[int(len(times) * 0.95) - 1] * 1000
print(round(p95, 1))
PY
)
 
echo "p95 startup: ${P95_MS} ms (threshold ${THRESHOLD_MS} ms)"
if awk "BEGIN { exit !(${P95_MS} > ${THRESHOLD_MS}) }"; then
  echo "🛑 startup latency regressed"
  exit 1
fi
echo "✅ startup latency within budget"

Drop this gate into whatever workflow updates the CLI and you will automatically catch the day startup quietly got heavier. Start the threshold a bit above your environment's warm median (I use about 2.5x the median) and tighten it as you go.

Run hyperfine --warmup 5 'antigravity --version' once on your own machine first and get your own number. Once a feeling becomes a figure, it gets surprisingly easy to see clearly where speed is worth chasing — and whether it is worth chasing at all.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.