Articles/Agents & Manager

◈ Agents & Manager/2026-06-22Advanced

Stop Hard-Coding Your Agent Concurrency: Let It Tune Itself From What It Observes

When you run several Antigravity 2.0 agents in parallel, a single fixed concurrency number is wrong twice: it stalls at 429s during the day and idles capacity at night. Here is an adaptive design borrowed from TCP congestion control — additive increase, multiplicative decrease — that moves your concurrency from observed signals, with working Python and field notes.

Antigravity²⁵⁵ AI agents²² parallel execution² adaptive control production operations⁴

✦ Premium Article

The parallel agents you run quietly overnight start spitting 429 Too Many Requests the moment your own morning work overlaps with them. If you picked one fixed number for concurrency, this is the path you almost always walk.

A fixed number is wrong in both directions. Set it low and your agents queue up and idle at 3am when nobody else is around. Set it high and they fight you for quota at midday while you are in a conversation on the same account, and the downstream API jams. Whichever way you lean, you are wrong during the other half of the day.

This article is about a design that does not hold a concurrency limit but moves it — observing, then adjusting. The setting is Antigravity 2.0 parallel agents and the Managed Agents API, but the idea ports to any agent execution layer unchanged.

A Fixed Concurrency Number Is Usually Wrong

Let me put the failure structure down concretely. Suppose you pin concurrency at 6 for some workload.

Time	Downstream headroom	Result at concurrency 6
Late night (you are away)	Large	Wastes capacity. 10 or 12 would have cleared fine
Evening (you are also active)	Small	Shares quota with you and cascades into 429s
Right after a major update	Unknown	Latency climbs; even 6 can jam

Whichever row you optimize the constant for, it misses on the others. And the "correct" concurrency shifts by the minute, driven by things you do not control: whether you are at your desk, how much quota remains, how busy the model side is. Trying to pin a constant onto a moving target is the design error itself.

Why the Right Concurrency Never Holds Still

Antigravity 2.0 runs planning, code generation, and live browser testing across several agents, centered on Gemini 3.5 Flash. The faster Flash gets, the higher the density of requests you fire downstream per unit time. In other words, the smarter and faster your agents become, the tighter your concurrency ceiling has to be.

The Managed Agents API makes this sharper. A single call spins up an agent in an isolated environment. Because launching is so cheap, it is easy to raise your parallelism without noticing and bump your head on a shared quota ceiling.

Here is where an idea from networking, in use for decades, earns its keep. TCP does not know the link bandwidth in advance. It nudges its send rate up little by little, and the instant it sees a "congestion signal" — a lost packet — it cuts back hard. Repeat, and it tracks the real bandwidth continuously. Agent concurrency can be treated the same way: as a quantity you follow by observation, not one you declare.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Drop a working asyncio controller into your agent calls that stops storing concurrency as a constant and instead raises or lowers it automatically from observed congestion signals like 429s and quota warnings

✦Translate AIMD — additive increase, multiplicative decrease, proven for decades in TCP — onto Managed Agents API and parallel agent calls, including how to pick real numbers that avoid both over-throttling and idling

✦See exactly how this differs from a fixed-ceiling backpressure design, how to layer the two, and field notes from running several automated pipelines side by side where the right concurrency never stopped moving

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Borrowing From TCP: Additive Increase, Multiplicative Decrease

The core borrowing is AIMD — Additive Increase, Multiplicative Decrease.

Additive increase: while successes keep coming, add to the concurrency slowly, in a fixed step.
Multiplicative decrease: the moment you see a congestion signal, cut it by a factor (halve it, say).

Cautious going up, decisive coming down. That asymmetry is the whole point. Congestion means you are starting to break something downstream, so easing down linearly arrives too late. Going up, by contrast, is only a probe for headroom — there is no need to hurry.

import time
from dataclasses import dataclass, field
 
 
@dataclass
class AIMDController:
    """Raises or lowers concurrency in response to observations."""
 
    min_limit: int = 1
    max_limit: int = 12
    limit: float = 2.0            # current allowed concurrency (kept as float, rounded to use)
    increase_step: float = 0.5   # added per success (additive increase)
    decrease_factor: float = 0.5 # multiplied on congestion (multiplicative decrease)
    cooldown_s: float = 8.0      # suppress re-increase right after a cut
    _last_decrease: float = field(default=0.0)
 
    def on_success(self) -> None:
        # Do not grow for a while after a cut, to avoid oscillation.
        if time.monotonic() - self._last_decrease < self.cooldown_s:
            return
        self.limit = min(self.max_limit, self.limit + self.increase_step)
 
    def on_throttle(self) -> None:
        self.limit = max(self.min_limit, self.limit * self.decrease_factor)
        self._last_decrease = time.monotonic()
 
    @property
    def concurrency(self) -> int:
        return max(self.min_limit, int(self.limit))

Keeping limit as a float, not an int, lets increase_step be smaller than 1. You can grow by one slot only every two successes — a slow climb that calms the thrashing you otherwise get near the ceiling.

What Counts as a Congestion Signal

What you feed on_throttle() decides the precision of the whole design. Here are the signals, strongest first.

Signal	Meaning	Reaction
HTTP 429 / quota-exceeded error	Downstream explicitly says "no more"	Multiplicative decrease at once
Quota-remaining warning header	Approaching the wall	Stop increasing (do not cut)
Sustained latency rise	Congestion is forming	Ease down, or freeze the increase
Timeouts / dropped connections	Congestion turned into real harm	Cut, and spend retry budget

Watching only the top row — the 429 — already works as a first step. Reach for latency only after you add a moving average so noise does not whip you around. Halving concurrency on one slow response tips you back into wasting capacity.

Put a classifier just outside the call to sort each response into success or congestion.

class ThrottleError(Exception):
    """Raised when downstream clearly signals overload."""
 
 
def classify(status_code: int, headers: dict) -> str:
    if status_code == 429 or status_code == 503:
        return "throttled"
    # A remaining-quota warning means "stop growing", not "shrink".
    remaining = headers.get("x-quota-remaining")
    if remaining is not None and int(remaining) < 5:
        return "near_limit"
    return "ok"

Wiring It Into an Adaptive Scheduler

The controller's concurrency changes from moment to moment, so a fixed-size semaphore cannot express it. You need a variable gate that re-checks the current limit every time.

import asyncio
 
 
class AdaptiveScheduler:
    def __init__(self, controller: AIMDController):
        self.ctrl = controller
        self._inflight = 0
        self._cond = asyncio.Condition()
 
    async def _acquire(self) -> None:
        async with self._cond:
            # Re-read the allowed count each pass, so a shrink takes effect at once.
            while self._inflight >= self.ctrl.concurrency:
                await self._cond.wait()
            self._inflight += 1
 
    async def _release(self) -> None:
        async with self._cond:
            self._inflight -= 1
            self._cond.notify_all()
 
    async def run(self, call):
        """call() is an async function returning (status_code, headers, body)."""
        await self._acquire()
        try:
            status, headers, body = await call()
            outcome = classify(status, headers)
            if outcome == "throttled":
                self.ctrl.on_throttle()
            elif outcome == "ok":
                self.ctrl.on_success()
            # on near_limit, neither grow nor shrink
            return status, headers, body
        except (ThrottleError, asyncio.TimeoutError):
            self.ctrl.on_throttle()
            raise
        finally:
            await self._release()
            # the ceiling may have dropped, so wake waiters to re-evaluate
            async with self._cond:
                self._cond.notify_all()

The point is that _acquire() re-reads self.ctrl.concurrency inside the loop. The instant on_throttle() drops the ceiling from 6 to 3, any task that wants a fresh slot waits right there. Work already running is never killed. By squeezing only "how many to start next," you lower the flow without breaking what is in flight.

Lay this over the Managed Agents API launch call and the number of isolated-environment agents you spin up at once adjusts itself from observation. The same holds for local parallel agent calls.

How This Differs From Backpressure, and How to Layer Them

"Limit the concurrency" sits right next to a backpressure design built from a bounded queue and a semaphore. The two do not compete; they play different roles.

Aspect	Fixed-ceiling backpressure	Adaptive control (AIMD)
How the limit is set	A constant a human chose in advance	Followed automatically from observation
What it protects	A ceiling that surely won't break downstream	Approach to the moment's optimum
Weak at	Tracking changing conditions	Being an absolute safety valve under runaway

In practice, layering both is the sound move. Let AIMD hunt for the daily optimum, and put an "under no circumstances exceed this" absolute cap on top as a fixed semaphore. max_limit already plays that part, but if you want to rule out cost accidents, keep one more hard human-chosen ceiling outside the controller. Adaptation is for optimization; the fixed cap is for accident prevention. Separating those two purposes keeps the design clear.

Tuning Against Over- and Under-Throttling

A few instincts help when picking the numbers.

Parameter	Too large	Too small	Starting point
increase_step	Oscillates near the ceiling	Slow to recover	around 0.5
decrease_factor	Cut too weak, 429s persist	Over-cuts and idles	0.5 (halve)
cooldown_s	Sluggish to track change	Thrashes on rebound	5–10x median latency

Start with a decisive decrease and a modest increase. If you clear congestion quickly and surely, you can afford to idle a little and still protect downstream. As you settle in, raise increase_step to reclaim the idle slots.

One trap: combine retries and adaptive control naively and a task that took a 429 retries immediately, which calls on_throttle() again and cuts you twice over. Leave retries to exponential backoff and decide that congestion is reported to the controller only on the first occurrence. That keeps the two from crossing wires.

Field Notes

There was a stretch when, as an indie developer, I ran a pipeline that updates several sites automatically alongside the background processing for a handful of mobile apps, all in the same hours. At first concurrency was fixed, and I repeated both failures daily: a conservative value left obvious headroom unused overnight, then 429s cascaded the moment I started touching things during the day.

Dropping the constant and switching to a plain AIMD with the 429 as its only signal, the cascading evening failures went quiet first. The controller saw congestion and cut on its own, so the chore of manually lowering the value whenever I sat down disappeared. At night the opposite happened: with nobody competing, it climbed slowly to the ceiling and used the slots up without my attention. I did not add any clever prediction. I just handed the machine one thing — look at the result, decide whether to grow or shrink.

You could say it is nothing more than re-applying an idea long settled in control systems to a new target, agent parallelism. The newer the thing, the better the old, well-tested principles seem to fit. If you have been worn down by a fixed value, start with a small controller that watches only the 429.

Thank you for reading.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.