When the Android CLI Got 3x Faster and Cut Tokens by ~70%, the Right Move Was More Verification Per Change — Not More Parallelism
Reading that the Android CLI agent runs ~3x faster while using ~70% fewer tokens, my first instinct was to ask how many runs to parallelize. But a faster agent doesn't change how much work ships — it changes where the queue forms. This walks through why, sizes the new bottleneck (review and verification gates) with Little's Law, enforces a WIP cap with a working Python admission controller, and reinvests the freed budget into depth per change — with measured results.
When I read that the Android CLI agent completes tasks about 3x faster while using roughly 70% fewer tokens, my first instinct — as an indie developer juggling several apps and sites — was to ask how many runs I should fire in parallel. If each run is faster, surely I can get through more work.
A little hands-on time proved that instinct wrong. What actually changes when the agent gets faster is not "how much work I can produce," but "where the queue forms." When the model was the slow step, adding capacity moved things forward. The moment the model gets fast, the head of the line moves to review and the verification gates. Raise parallelism without looking there, and you don't ship more — you just miss more.
This post frames why you should not convert speed into parallelism, sizes the new bottleneck with queueing math, enforces the limit as an admission controller, and reinvests the freed token budget into verification per change — all with code and measured numbers.
What a 3x speedup and 70% token cut really change: where it clogs
How an agent setup feels is decided by where the rate-limiting step sits. When generation was slow, most of the waiting was the model, so "faster model" and "more parallelism" paid off directly. Cut generation to a third and generation-wait drops to a third.
The catch is that the path from a change to production is not just generation. The agent's diff has to pass verification gates (type checks, tests, lint, evals), then a human review, before it merges. Speeding up generation does not widen those later stages at the same rate. Test runtime is unchanged, and the hours a person can spend reviewing are roughly fixed per day.
So a 3x/70% improvement doesn't speed up the whole pipeline — it moves the rate-limiting step. The head of the line shifts from the model to "gates plus review." Ignore that and you accumulate changes that are generated but neither verified nor reviewed. Only the entrance got faster; the exit is the same width.
What happens when you convert speed into parallelism (Little's Law)
To reason in numbers instead of vibes, use the most basic relationship in queueing theory — Little's Law.
L = λ × W
L … items in the system at once (here: changes in progress = WIP)
λ … throughput (items passing through the system per unit time)
W … average time an item spends in the system (lead time)
Treat the system as "generate -> verify -> review -> merge." Then λ (changes you can truly merge per week) is set by the later stages. No matter how fast generation gets, the λ that verification and review can absorb does not rise.
Increase only parallelism (WIP = L) and the math forces W (lead time) to grow. With λ fixed and L up, W = L / λ must increase. That is exactly the state of "lots in flight, but each item takes longer and longer to come out." I reproduced it myself: right after switching to a faster model I doubled parallelism, and the days-to-merge for a given change actually got longer.
The conclusion is simple. Unless you raise the later-stage throughput λ, more WIP does not increase delivered. Speed should go somewhere other than parallelism.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Use Little's Law to size the cap — admissible WIP equals review-plus-gate throughput times target lead time — so a 3x-faster agent doesn't tempt you past the parallelism your downstream can actually absorb
✦See, structurally, why pouring the speedup into parallelism leaves weekly delivered flat while only review quality drops — and stop it with a working Python admission controller (WIP cap)
✦Get a concrete rule for reinvesting the ~70% token savings into a second self-review pass and extra evals per change, plus the two metrics (delivered/week and escape rate) that prove it worked
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Set the parallelism cap from verification throughput, not the downstream
So how do you set the cap? Not from a downstream like an external API rate limit, but from the narrowest later stage — which, in most solo setups, is human review time.
Use Little's Law directly. Pick a target lead time W_target, measure the later-stage throughput λ, and the admissible WIP cap falls out.
WIP_max = λ_review × W_target
Example: if review plus gates reliably clear 6 changes/day (λ_review = 6/day)
and you target a lead time of 1 day (W_target = 1 day):
WIP_max = 6 × 1 = 6
→ No matter how fast generation is, allow at most 6 changes "in progress" at once.
It matters that λ_review is measured, not wished. I counted merged changes daily for about two weeks and took the median. Use the real value — including interrupted and off days — not the "I could do more if I tried" number. Set it optimistically and the queue overflows and you are back where you started.
Enforce the WIP cap with an admission controller (Python)
Once the cap is set, enforce it mechanically. Put a single admission controller just before the scheduler dispatches a task: if in-progress work has hit WIP_max, hold the new task. No clever control is needed — one gate that increments and decrements a count atomically is enough.
# admission_controller.py# Keep in-progress (WIP) changes under WIP_MAX.# State lives in one file, updated atomically with flock (safe under concurrent starts).import fcntlimport jsonimport osimport timefrom contextlib import contextmanagerfrom dataclasses import dataclassSTATE_PATH = os.environ.get("WIP_STATE", "/var/agent/wip_state.json")@dataclass(frozen=True)class WipConfig: wip_max: int # = round(lambda_review_per_day * W_target_days) stale_after_sec: int # if still in-flight past this, reclaim as stuck@contextmanagerdef _locked_state(path: str): """Read/write the state file under an exclusive lock.""" os.makedirs(os.path.dirname(path), exist_ok=True) fd = os.open(path, os.O_RDWR | os.O_CREAT, 0o644) try: fcntl.flock(fd, fcntl.LOCK_EX) raw = os.read(fd, 1_000_000).decode() or "{}" state = json.loads(raw) state.setdefault("inflight", {}) # change_id -> started_at(epoch) yield state os.lseek(fd, 0, os.SEEK_SET) os.ftruncate(fd, 0) os.write(fd, json.dumps(state).encode()) finally: fcntl.flock(fd, fcntl.LOCK_UN) os.close(fd)def _reap_stale(state: dict, stale_after_sec: int) -> None: """Drop changes stuck with no verify/review progress (keeps the count honest).""" now = time.time() for cid, started in list(state["inflight"].items()): if now - started > stale_after_sec: del state["inflight"][cid]def try_admit(change_id: str, cfg: WipConfig) -> bool: """Register in-flight and return True if there is room; False if full.""" with _locked_state(STATE_PATH) as state: _reap_stale(state, cfg.stale_after_sec) if change_id in state["inflight"]: return True # idempotent: never start the same change twice if len(state["inflight"]) >= cfg.wip_max: return False state["inflight"][change_id] = time.time() return Truedef release(change_id: str) -> None: """Once merged or rejected, free the slot.""" with _locked_state(STATE_PATH) as state: state["inflight"].pop(change_id, None)if __name__ == "__main__": # λ_review = 6/day, W_target = 1 day → WIP_MAX = 6 cfg = WipConfig(wip_max=6, stale_after_sec=36 * 3600) cid = os.environ["CHANGE_ID"] if try_admit(cid, cfg): print(f"admit {cid}") # only now let the agent start else: print(f"defer {cid}") # hold until a later stage frees up raise SystemExit(75) # EX_TEMPFAIL: tell the scheduler to retry later
Three things matter. First, flock makes the update atomic, so concurrent scheduled runs never skew the count. Second, try_admit is idempotent per change_id, so retries never double-start. Third, _reap_stale evicts changes stuck without progress after a set time; without it, a failed change holds its slot forever and healthy changes can never start.
The agent launcher runs only what try_admit allows and calls release when done. However fast generation gets, changes in progress never exceed WIP_MAX.
Spend the freed token budget on depth, not breadth
Capping parallelism leaves you with spare speed and tokens. I chose to put the 3x/70% surplus into depth (more verification per change), not breadth (more changes). Since the later stages set delivered, breadth wouldn't move it anyway.
Concretely, per change I added:
A second self-review pass. Have the agent review its own diff under a separate prompt, listing mismatches against the spec and unhandled edge cases. If issues surface, rewrite before a human sees it.
Targeted extra evals. Beyond tests for the touched area, run one extra set of golden outputs for places that broke on similar past fixes.
A richer review summary. Attach the intent, the risk, and the points to check, structured so a human reads it in 30 seconds. This one actually raises λ_review.
The third is different in kind. Items 1 and 2 invest in per-change quality; item 3 invests in the later-stage throughput λ itself. Faster, more accurate review lets you raise WIP_max — one of the few moves that helps both depth and breadth.
How to measure it: delivered/week and escape rate
After changing the design, confirm with numbers, not impressions. I track just two.
delivered/week: changes actually merged. If parallelism rises but the later stages are unchanged, this should not move — test whether that prediction holds.
escape rate: the share of merged changes where a defect that review should have caught turns up afterward. Raise parallelism and thin out review, and this always worsens.
My two-week, two-condition comparison — same later-stage capacity, one run parallelism-first and one WIP-capped with deeper verification — came out roughly like this:
Aspect
Parallelism-first (no WIP cap)
WIP cap + depth of verification
Changes in progress (WIP)
fluctuates 12–18
steady at 6
delivered/week
~28
~30
Lead time per change (median)
~3.2 days
~1.1 days
Escape rate (post-merge rework)
~14%
~5%
Tokens per change
baseline
~0.5 of baseline even with the second pass
delivered/week barely moved — as predicted, because the later stages are the constraint. Meanwhile lead time fell to a third and escape rate to under a third. Tokens stayed at half the baseline even after adding the second self-review. In short, routing the speed-and-token surplus into depth kept output steady while improving both speed (lead time) and certainty (escape rate) at once.
These are one environment's numbers and shift with your later-stage throughput and change difficulty. What matters is less the absolute values than whether the structure — "parallelism-first doesn't grow delivered but does grow risk" — reproduces in your own data.
The order to roll this out (your next move)
If you start tomorrow, do one thing first. For two weeks, count the changes you actually merge each day, take the median for λ_review, and compute WIP_max = λ_review × W_target. Put it in the admission controller to physically cap parallelism, and route the freed speed and tokens into "a second self-review pass per change." Reverse the order — raise parallelism first — and you'll be scrambling to add a cap only after the queue has already stretched.
When a faster tool lands, I've come to believe the first question is not "how many can I run at once" but "where is the narrowest stage right now." Speed the entrance without widening the narrow part and the line only grows. Speed belongs not in parallelism but in making each change certain — and that judgment is what quietly holds up unattended operations.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.