Stop Hard-Coding Your Agent Concurrency: Let It Tune Itself From What It Observes
When you run several Antigravity 2.0 agents in parallel, a single fixed concurrency number is wrong twice: it stalls at 429s during the day and idles capacity at night. Here is an adaptive design borrowed from TCP congestion control — additive increase, multiplicative decrease — that moves your concurrency from observed signals, with working Python and field notes.
The parallel agents you run quietly overnight start spitting 429 Too Many Requests the moment your own morning work overlaps with them. If you picked one fixed number for concurrency, this is the path you almost always walk.
A fixed number is wrong in both directions. Set it low and your agents queue up and idle at 3am when nobody else is around. Set it high and they fight you for quota at midday while you are in a conversation on the same account, and the downstream API jams. Whichever way you lean, you are wrong during the other half of the day.
This article is about a design that does not hold a concurrency limit but moves it — observing, then adjusting. The setting is Antigravity 2.0 parallel agents and the Managed Agents API, but the idea ports to any agent execution layer unchanged.
A Fixed Concurrency Number Is Usually Wrong
Let me put the failure structure down concretely. Suppose you pin concurrency at 6 for some workload.
Time
Downstream headroom
Result at concurrency 6
Late night (you are away)
Large
Wastes capacity. 10 or 12 would have cleared fine
Evening (you are also active)
Small
Shares quota with you and cascades into 429s
Right after a major update
Unknown
Latency climbs; even 6 can jam
Whichever row you optimize the constant for, it misses on the others. And the "correct" concurrency shifts by the minute, driven by things you do not control: whether you are at your desk, how much quota remains, how busy the model side is. Trying to pin a constant onto a moving target is the design error itself.
Why the Right Concurrency Never Holds Still
Antigravity 2.0 runs planning, code generation, and live browser testing across several agents, centered on Gemini 3.5 Flash. The faster Flash gets, the higher the density of requests you fire downstream per unit time. In other words, the smarter and faster your agents become, the tighter your concurrency ceiling has to be.
The Managed Agents API makes this sharper. A single call spins up an agent in an isolated environment. Because launching is so cheap, it is easy to raise your parallelism without noticing and bump your head on a shared quota ceiling.
Here is where an idea from networking, in use for decades, earns its keep. TCP does not know the link bandwidth in advance. It nudges its send rate up little by little, and the instant it sees a "congestion signal" — a lost packet — it cuts back hard. Repeat, and it tracks the real bandwidth continuously. Agent concurrency can be treated the same way: as a quantity you follow by observation, not one you declare.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Drop a working asyncio controller into your agent calls that stops storing concurrency as a constant and instead raises or lowers it automatically from observed congestion signals like 429s and quota warnings
✦Translate AIMD — additive increase, multiplicative decrease, proven for decades in TCP — onto Managed Agents API and parallel agent calls, including how to pick real numbers that avoid both over-throttling and idling
✦See exactly how this differs from a fixed-ceiling backpressure design, how to layer the two, and field notes from running several automated pipelines side by side where the right concurrency never stopped moving
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Borrowing From TCP: Additive Increase, Multiplicative Decrease
The core borrowing is AIMD — Additive Increase, Multiplicative Decrease.
Additive increase: while successes keep coming, add to the concurrency slowly, in a fixed step.
Multiplicative decrease: the moment you see a congestion signal, cut it by a factor (halve it, say).
Cautious going up, decisive coming down. That asymmetry is the whole point. Congestion means you are starting to break something downstream, so easing down linearly arrives too late. Going up, by contrast, is only a probe for headroom — there is no need to hurry.
import timefrom dataclasses import dataclass, field@dataclassclass AIMDController: """Raises or lowers concurrency in response to observations.""" min_limit: int = 1 max_limit: int = 12 limit: float = 2.0 # current allowed concurrency (kept as float, rounded to use) increase_step: float = 0.5 # added per success (additive increase) decrease_factor: float = 0.5 # multiplied on congestion (multiplicative decrease) cooldown_s: float = 8.0 # suppress re-increase right after a cut _last_decrease: float = field(default=0.0) def on_success(self) -> None: # Do not grow for a while after a cut, to avoid oscillation. if time.monotonic() - self._last_decrease < self.cooldown_s: return self.limit = min(self.max_limit, self.limit + self.increase_step) def on_throttle(self) -> None: self.limit = max(self.min_limit, self.limit * self.decrease_factor) self._last_decrease = time.monotonic() @property def concurrency(self) -> int: return max(self.min_limit, int(self.limit))
Keeping limit as a float, not an int, lets increase_step be smaller than 1. You can grow by one slot only every two successes — a slow climb that calms the thrashing you otherwise get near the ceiling.
What Counts as a Congestion Signal
What you feed on_throttle() decides the precision of the whole design. Here are the signals, strongest first.
Signal
Meaning
Reaction
HTTP 429 / quota-exceeded error
Downstream explicitly says "no more"
Multiplicative decrease at once
Quota-remaining warning header
Approaching the wall
Stop increasing (do not cut)
Sustained latency rise
Congestion is forming
Ease down, or freeze the increase
Timeouts / dropped connections
Congestion turned into real harm
Cut, and spend retry budget
Watching only the top row — the 429 — already works as a first step. Reach for latency only after you add a moving average so noise does not whip you around. Halving concurrency on one slow response tips you back into wasting capacity.
Put a classifier just outside the call to sort each response into success or congestion.
class ThrottleError(Exception): """Raised when downstream clearly signals overload."""def classify(status_code: int, headers: dict) -> str: if status_code == 429 or status_code == 503: return "throttled" # A remaining-quota warning means "stop growing", not "shrink". remaining = headers.get("x-quota-remaining") if remaining is not None and int(remaining) < 5: return "near_limit" return "ok"
Wiring It Into an Adaptive Scheduler
The controller's concurrency changes from moment to moment, so a fixed-size semaphore cannot express it. You need a variable gate that re-checks the current limit every time.
import asyncioclass AdaptiveScheduler: def __init__(self, controller: AIMDController): self.ctrl = controller self._inflight = 0 self._cond = asyncio.Condition() async def _acquire(self) -> None: async with self._cond: # Re-read the allowed count each pass, so a shrink takes effect at once. while self._inflight >= self.ctrl.concurrency: await self._cond.wait() self._inflight += 1 async def _release(self) -> None: async with self._cond: self._inflight -= 1 self._cond.notify_all() async def run(self, call): """call() is an async function returning (status_code, headers, body).""" await self._acquire() try: status, headers, body = await call() outcome = classify(status, headers) if outcome == "throttled": self.ctrl.on_throttle() elif outcome == "ok": self.ctrl.on_success() # on near_limit, neither grow nor shrink return status, headers, body except (ThrottleError, asyncio.TimeoutError): self.ctrl.on_throttle() raise finally: await self._release() # the ceiling may have dropped, so wake waiters to re-evaluate async with self._cond: self._cond.notify_all()
The point is that _acquire() re-reads self.ctrl.concurrency inside the loop. The instant on_throttle() drops the ceiling from 6 to 3, any task that wants a fresh slot waits right there. Work already running is never killed. By squeezing only "how many to start next," you lower the flow without breaking what is in flight.
Lay this over the Managed Agents API launch call and the number of isolated-environment agents you spin up at once adjusts itself from observation. The same holds for local parallel agent calls.
How This Differs From Backpressure, and How to Layer Them
"Limit the concurrency" sits right next to a backpressure design built from a bounded queue and a semaphore. The two do not compete; they play different roles.
Aspect
Fixed-ceiling backpressure
Adaptive control (AIMD)
How the limit is set
A constant a human chose in advance
Followed automatically from observation
What it protects
A ceiling that surely won't break downstream
Approach to the moment's optimum
Weak at
Tracking changing conditions
Being an absolute safety valve under runaway
In practice, layering both is the sound move. Let AIMD hunt for the daily optimum, and put an "under no circumstances exceed this" absolute cap on top as a fixed semaphore. max_limit already plays that part, but if you want to rule out cost accidents, keep one more hard human-chosen ceiling outside the controller. Adaptation is for optimization; the fixed cap is for accident prevention. Separating those two purposes keeps the design clear.
Tuning Against Over- and Under-Throttling
A few instincts help when picking the numbers.
Parameter
Too large
Too small
Starting point
increase_step
Oscillates near the ceiling
Slow to recover
around 0.5
decrease_factor
Cut too weak, 429s persist
Over-cuts and idles
0.5 (halve)
cooldown_s
Sluggish to track change
Thrashes on rebound
5–10x median latency
Start with a decisive decrease and a modest increase. If you clear congestion quickly and surely, you can afford to idle a little and still protect downstream. As you settle in, raise increase_step to reclaim the idle slots.
One trap: combine retries and adaptive control naively and a task that took a 429 retries immediately, which calls on_throttle() again and cuts you twice over. Leave retries to exponential backoff and decide that congestion is reported to the controller only on the first occurrence. That keeps the two from crossing wires.
Field Notes
There was a stretch when, as an indie developer, I ran a pipeline that updates several sites automatically alongside the background processing for a handful of mobile apps, all in the same hours. At first concurrency was fixed, and I repeated both failures daily: a conservative value left obvious headroom unused overnight, then 429s cascaded the moment I started touching things during the day.
Dropping the constant and switching to a plain AIMD with the 429 as its only signal, the cascading evening failures went quiet first. The controller saw congestion and cut on its own, so the chore of manually lowering the value whenever I sat down disappeared. At night the opposite happened: with nobody competing, it climbed slowly to the ceiling and used the slots up without my attention. I did not add any clever prediction. I just handed the machine one thing — look at the result, decide whether to grow or shrink.
You could say it is nothing more than re-applying an idea long settled in control systems to a new target, agent parallelism. The newer the thing, the better the old, well-tested principles seem to fit. If you have been worn down by a fixed value, start with a small controller that watches only the 429.
Thank you for reading.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.