When Parallel Sub-Agents Fight Over One API's Rate Limit: A Shared Token Bucket That Caps the Aggregate

Run Antigravity 2.0 dynamic sub-agents in parallel and each one hits the same external API independently, pushing the aggregate rate over the limit and triggering cascades of 429s. Here is a shared token bucket that caps the aggregate proactively, with working code through a Redis version.

antigravity⁴⁰² agents¹¹¹ rate-limit⁴ token-bucket parallel²

✦ Premium Article

Once the convenience of Antigravity 2.0 dynamic sub-agents settles in — four tasks moving at once — the next wall is rarely speed. It is the external API limit. As an indie developer running several blogs on an automated pipeline, I once watched every sub-agent push a commit to GitHub at the same moment and collectively trip a 403 secondary rate limit. Running them one at a time, I had never seen that error.

The cause is simple. Line up N sub-agents and the send rate as observed from the outside becomes N times higher. Each agent may believe it is backing off politely, but to the service that owns the shared limit, N agents are simply arriving all at once.

What follows is how to solve that N-times problem not with "each agent's good manners" but with "one faucet every agent must pass through" — with working code and measured numbers. The example targets GitHub's secondary rate limit, but the same design applies anywhere a single limit is shared by multiple actors: Stripe, AdMob reporting APIs, your own backend.

Why per-agent backoff breaks under concurrency

The first thing I tried was the obvious fix: wrap each sub-agent's HTTP call in a retrier that backs off exponentially when it sees a 429.

// Looks correct, but falls apart under concurrency: per-agent backoff
async function callWithBackoff(fn: () => Promise<Response>): Promise<Response> {
  let delay = 500;
  for (let attempt = 0; attempt < 6; attempt++) {
    const res = await fn();
    if (res.status !== 429 && res.status !== 403) return res;
    await sleep(delay);     // every agent waits the same delay at nearly the same time
    delay *= 2;
  }
  throw new Error("rate limit: gave up");
}

For a single agent this works. But run six sub-agents in parallel and you get this chain:

Six fire almost simultaneously; the aggregate rate exceeds the limit
Almost simultaneously, all six receive a 429
All six wait the same initial delay (500ms)
After 500ms, all six retry simultaneously again — and all six get 429 again

This is the thundering herd. The same spike is regenerated on every retry, and backoff only widens the gap between spikes; it never flattens the spike itself. Adding jitter spreads things out a little, but that only lowers the collision probability. It is not a mechanism that guarantees the aggregate stays at or below the limit.

So change the framing. Stop apologizing after you send (reactive); ask permission before you send (proactive). Concentrate the permission-granting into one place, and by definition the aggregate rate can never exceed that place's issue rate. That is what a token bucket is for.

A shared token bucket in a single process

A token bucket is a pail of capacity tokens refilled at refillPerSec per second; each API call consumes one token. If no token is available, you wait for a refill. The crucial part is that all sub-agents share one and the same bucket.

// shared-limiter.ts — a fair FIFO async token bucket
type Waiter = { cost: number; resolve: () => void };
 
export class TokenBucket {
  private tokens: number;
  private last: number;
  private waiters: Waiter[] = [];
  private timer: ReturnType<typeof setInterval> | null = null;
 
  constructor(
    private readonly capacity: number,    // burst allowance
    private readonly refillPerSec: number // steady-state rate (issued per second)
  ) {
    this.tokens = capacity;
    this.last = Date.now();
  }
 
  private refill(): void {
    const now = Date.now();
    const elapsed = (now - this.last) / 1000;
    if (elapsed <= 0) return;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillPerSec);
    this.last = now;
  }
 
  // Always await before the call. The wait delays sending so the aggregate stays under the limit.
  async acquire(cost = 1): Promise<void> {
    if (cost > this.capacity) {
      throw new Error("cost exceeds capacity: it can never be acquired");
    }
    this.refill();
    // Take immediately only if nobody is queued (no jumping the line = fairness)
    if (this.waiters.length === 0 && this.tokens >= cost) {
      this.tokens -= cost;
      return;
    }
    return new Promise<void>((resolve) => {
      this.waiters.push({ cost, resolve });
      this.startDraining();
    });
  }
 
  private startDraining(): void {
    if (this.timer) return;
    this.timer = setInterval(() => {
      this.refill();
      // Release from the head of the queue while tokens suffice (FIFO)
      while (this.waiters.length > 0 && this.tokens >= this.waiters[0].cost) {
        const w = this.waiters.shift()!;
        this.tokens -= w.cost;
        w.resolve();
      }
      if (this.waiters.length === 0 && this.timer) {
        clearInterval(this.timer);
        this.timer = null;
      }
    }, 50); // re-evaluate refill and release every 50ms
  }
}

Using it is just a matter of slipping acquire() in immediately before the external call.

// Keep GitHub content-creating calls conservative: 1.0/sec, burst 5
const github = new TokenBucket(5, 1.0);
 
async function commitViaSubAgent(agentId: string, change: FileChange): Promise<void> {
  await github.acquire(1);          // this is where queueing happens
  await githubApi.createCommit(change);
}
 
// Six sub-agents share the same github bucket and run in parallel
await Promise.all(
  subAgents.map((a) => commitViaSubAgent(a.id, a.pendingChange))
);

Three things matter here. First, while one agent waits on acquire(), the other sub-agents keep computing, so the only throughput cost is the part that was pinned to the limit anyway. Second, the queue is FIFO, so no single agent is starved by being pushed back forever. Third, capacity represents the burst allowance, so the smaller you make it, the more you suppress instantaneous spikes.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Stop the 429 cascades that happen when parallel sub-agents share one external API limit, by capping the aggregate rate with a shared token bucket

✦Understand why retry-plus-backoff alone produces a thundering herd, and replace it with proactive cooperative throttling

✦Extend a single-process token bucket into an atomic Redis + Lua acquire so sub-agents across separate processes and machines stay under one shared limit

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

How to choose the bucket's parameters

I set refillPerSec based on "what fraction of the other side's limit to use." In production I aim for 60–70% of the limit. Leaving headroom matters because actors other than the automated sub-agents — my own manual operations, a separate CI job — eat into the same limit. When the limit is undocumented, like GitHub's secondary rate limit, I dial it back further.

capacity is "the instantaneous spike you are willing to allow." Set it close to refillPerSec and you approach perfectly evenly spaced sends; set it larger and you get "save up normally, then spend in a burst when needed." For services with little burst tolerance, keep capacity small. When the limit is not published, I have found that capping capacity at a few times the steady rate keeps the first spike from hurting.

A verification gate: confirm the effect by observation

A correct design is meaningless unless you observe that the aggregate rate actually stays under the limit. I always wrap sends to record the actual sends per second and the count of 429/403 responses.

// metered-fetch.ts — a thin wrapper that observes aggregate rate and rejections
let windowStart = Date.now();
let sentInWindow = 0;
let rejected = 0;
let peakRate = 0;
 
export async function meteredFetch(
  bucket: TokenBucket,
  input: RequestInfo,
  init?: RequestInit
): Promise<Response> {
  await bucket.acquire(1);
  const res = await fetch(input, init);
 
  sentInWindow++;
  if (res.status === 429 || res.status === 403) rejected++;
 
  const now = Date.now();
  if (now - windowStart >= 1000) {
    peakRate = Math.max(peakRate, sentInWindow);
    console.log("[rate] sent/s=" + sentInWindow + " peak=" + peakRate + " rejected=" + rejected);
    windowStart = now;
    sentInWindow = 0;
  }
  return res;
}

Compare this measurement before and after introducing the shared bucket and the difference is obvious at a glance. Here is what I measured on my automated pipeline, where six sub-agents push with the same GitHub token:

Metric	Per-agent backoff only	Shared token bucket
Observed peak send rate	9/sec (instantaneous spike)	steady at 5/sec or below
429/403 hit across 5 runs	31 total	0
API spent on wasted retries	about 27% were retries	nearly 0% retries
Wall-clock to finish all tasks	4m 12s (inflated by retry waits)	3m 40s

The surprise was that waiting proactively actually finished the whole job faster. Per-agent backoff throws away hundreds of milliseconds to seconds every time it hits a 429, so the more collisions, the more the total time balloons. Capping the aggregate under the limit removes the collisions entirely, so waiting shifts from "dead time spent retrying" to "just enough queueing." It avoided the irony of parallelizing for throughput and then melting that time away on retries.

Extending to multiple processes and machines (Redis + Lua)

Up to here every sub-agent ran in the same process. But Antigravity 2.0 can run sub-agents independently in the background, and once you spread scheduled tasks across machines, the token bucket becomes a separate instance per process and the sharing breaks. Each process honors its limit with "its own bucket," yet to the other side it is just as many buckets issuing in parallel.

The fix is to store the bucket state (remaining tokens and the last refill timestamp) in Redis and make acquire atomic. Splitting it across multiple Redis commands would race, so collapse it into one indivisible operation with a Lua script.

-- token_bucket.lua — KEYS[1] = bucket key
-- ARGV: capacity, refillPerSec, now(ms), cost
local capacity = tonumber(ARGV[1])
local refill   = tonumber(ARGV[2])
local now      = tonumber(ARGV[3])
local cost     = tonumber(ARGV[4])
 
local state  = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(state[1])
local ts     = tonumber(state[2])
if tokens == nil then tokens = capacity; ts = now end
 
-- refill by the elapsed time
local elapsed = math.max(0, (now - ts) / 1000)
tokens = math.min(capacity, tokens + elapsed * refill)
 
local allowed = 0
local wait_ms = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
else
  -- return the wait (ms) until the shortfall accrues
  wait_ms = math.ceil(((cost - tokens) / refill) * 1000)
end
 
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('PEXPIRE', KEYS[1], 60000)
return {allowed, wait_ms}

The caller sleeps for the returned wait until it is granted. Because the wait is computed server-side and returned, the client never has to guess at a backoff.

// distributed-limiter.ts
import { createClient } from "redis";
import { readFile } from "node:fs/promises";
 
const redis = createClient();
const script = await readFile("token_bucket.lua", "utf8");
 
export async function acquireDistributed(
  key: string, capacity: number, refillPerSec: number, cost = 1
): Promise<void> {
  for (;;) {
    const [allowed, waitMs] = (await redis.eval(script, {
      keys: [key],
      arguments: [String(capacity), String(refillPerSec), String(Date.now()), String(cost)],
    })) as [number, number];
    if (allowed === 1) return;
    // server-computed wait + small jitter to scatter simultaneous wake-ups
    await new Promise((r) => setTimeout(r, waitMs + Math.random() * 50));
  }
}

I hit two gotchas in production. First, time handling: if the refill math uses each client's local clock, clock skew between machines turns straight into error. Reading now from Redis's own TIME command rather than from the caller aligned everything onto a single timeline and stabilized it. Second, jitter: the wait that Lua returns can be identical across all clients, so removing it entirely brings back simultaneous wake-ups (a small thundering herd). A few tens of milliseconds of randomness was enough to spread them out.

Which implementation to pick at which stage

I laid out three implementations, but you do not need the Redis version from day one. I recommend choosing in stages based on how your sub-agents run.

Same process, Promise.all concurrency: the standalone TokenBucket is enough. Zero dependencies; paste the code above as-is.
Separate processes on one machine, multiple terminals: move to the Redis version. With a local Redis, clock skew is barely a concern.
Multiple machines, distributed scheduling: use the Redis + Lua version and always include TIME-sourced time and jitter. Skip those and you get the hardest failure to notice: "every machine is correct, yet the aggregate goes over."

If you share more than one limit (one for GitHub, one for Stripe, one for your own API), give each its own keyed bucket. When a single agent consumes several limits, just await each acquire() serially before the real work, and they compose cleanly.

Parallelism is a means to speed, but as long as you share an external limit, speed is bound by "how politely the whole fleet can send." Stop trusting each agent's goodwill and place one faucet that everyone must pass through. That single reframe freed me from chasing 429s and fine-tuning jitter. Start by slipping one TokenBucket in front of your concurrent sub-agents and watch the peak rate with meteredFetch. The moment the number flattens inside the limit is, I think, the most satisfying part.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.