Articles/Agents & Manager

◈ Agents & Manager/2026-06-16Advanced

Bundling Nightly Tasks With agy Async Jobs — A fan-out, poll, join Design

The Go-based Antigravity CLI (agy) can detach jobs and run them asynchronously. Here is a fan-out, poll, join design for firing many long-running tasks at once, collecting their job IDs, and waiting for completion — drawn from an actual nightly batch.

antigravity³⁶¹ CLI agy³ async job-management orchestration¹⁸ operations¹¹

✦ Premium Article

When I ran the nightly batch serially, the single slowest job set the pace for everything. Article generation, link audits, an AdMob mediation comparison report, multilingual screenshot swaps — all independent work, yet nothing started until the previous job finished. Added up, it sometimes ran until morning.

When Gemini CLI shut down on June 18 and I moved to the Go-based Antigravity CLI (agy), I took the chance to rethink that serial structure itself. agy can detach a job and run it asynchronously. That means you can build a flow of "fire it, remember the ID, join on it later."

This is how I designed that fan-out (firing all at once), poll (checking state), and join (waiting) so it holds up for solo operations. No heavy parallel framework — just the job IDs agy returns and a small shell that bundles them.

First, put numbers on the serial bottleneck

I started by quantifying what I was fixing. The nightly batch is 12 jobs. Run serially, the plain sum of each job's wall time becomes the total.

In my setup, the sum of all 12 averaged about 214 minutes. Yet each job's CPU and network utilization was low; waiting dominated. Waiting on LLM responses, sitting between API rate-limit windows, waiting for a git push to complete — all time where the machine is idle but cannot move on.

Fire them asynchronously in parallel and the total approaches "the slowest single job plus a little overhead." In practice 214 minutes became about 79 minutes — roughly a 63% reduction. The key point is that I did not make any job faster; I only overlapped the waiting.

What an agy async job returns

agy run has a normal mode that runs in the foreground, and a --detach mode that returns control immediately. When detached, it prints a single job ID to stdout.

# Foreground (the old way): blocks until done
agy run --task "generate article: antigravity cli async jobs" --model gemini-3.5-flash
 
# Detached: returns a job ID at once, continues in the background
JOB_ID=$(agy run --detach --json \
  --task "generate article: antigravity cli async jobs" \
  --model gemini-3.5-flash | jq -r '.job_id')
echo "submitted: $JOB_ID"

With --json, you get one machine-readable object instead of human-friendly decoration. Always pass --json when scripting. A version that scrapes the decorated output with grep breaks the moment the CLI's display changes slightly.

You read job state with agy jobs.

# State of one job
agy jobs get "$JOB_ID" --json
# => {"job_id":"j_8f3a","state":"running","exit_code":null,"started_at":"..."}
 
# List every job
agy jobs list --json

state returns one of queued / running / succeeded / failed / cancelled. exit_code carries a number only after the job ends. The join is built around these two fields.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Get a full shell implementation of fan-out -> poll -> join that bundles the job IDs returned by agy run --detach and reads state via agy jobs

✦Learn a wait loop that backs off polling exponentially and treats timeout and partial failure as distinct outcomes, plus the production gotchas I hit

✦See the operating rules that cut total wall-clock time for 12 nightly jobs by about 63% by moving from serial to async parallel execution

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

fan-out: fire independent tasks all at once

The fan-out is simple. Hold the tasks you want in an array, launch each detached, and collect the returned job IDs into another array.

#!/usr/bin/env bash
set -euo pipefail
 
# Independent tasks to submit (vary the model per task type)
declare -a TASKS=(
  "generate:article-a|gemini-3.5-flash"
  "audit:internal-links|gemini-3.5-flash"
  "report:admob-mediation|gemini-3.5-flash"
  "i18n:store-screenshots|gemini-3.5-flash"
)
 
declare -a JOB_IDS=()
declare -A JOB_LABEL=()   # job ID -> human-readable label
 
for entry in "${TASKS[@]}"; do
  label="${entry%%|*}"
  model="${entry##*|}"
  jid=$(agy run --detach --json --task "$label" --model "$model" | jq -r '.job_id')
  JOB_IDS+=("$jid")
  JOB_LABEL["$jid"]="$label"
  echo "fan-out: $label -> $jid"
done

The point here is building the label-to-ID map (JOB_LABEL) up front. When you later report a failed job, the string j_8f3a alone tells you nothing about what broke. Binding a human-readable name at submission time makes the morning log far easier to read.

A short gap between submissions helped in practice, too. Firing 12 jobs at once during a tight rate-limit window leaves several stuck in queued for a while. Just half a second of spacing visibly reduced the variance at startup.

poll: check state with exponential backoff

Once everything is fired, you watch state until all jobs end. A fixed polling interval wastes calls on short jobs and pressures long ones. I settled on an interval that grows exponentially.

poll_interval=2          # start at 2 seconds
max_interval=30          # cap at 30 seconds
deadline=$(( $(date +%s) + 3600 ))   # overall timeout: 60 minutes
 
declare -A FINAL_STATE=()
 
while :; do
  remaining=0
  for jid in "${JOB_IDS[@]}"; do
    # Do not re-query jobs that are already settled
    [[ -n "${FINAL_STATE[$jid]:-}" ]] && continue
 
    state=$(agy jobs get "$jid" --json | jq -r '.state')
    case "$state" in
      succeeded|failed|cancelled)
        FINAL_STATE["$jid"]="$state"
        echo "done: ${JOB_LABEL[$jid]} -> $state"
        ;;
      *)
        remaining=$((remaining + 1))
        ;;
    esac
  done
 
  # All settled -> leave
  [[ "$remaining" -eq 0 ]] && break
 
  # Overall timeout: cancel the rest and hand the decision to join
  if (( $(date +%s) > deadline )); then
    echo "timeout: cancelling $remaining job(s)"
    for jid in "${JOB_IDS[@]}"; do
      [[ -z "${FINAL_STATE[$jid]:-}" ]] && agy jobs cancel "$jid" >/dev/null 2>&1 || true
    done
    break
  fi
 
  sleep "$poll_interval"
  poll_interval=$(( poll_interval * 2 ))
  (( poll_interval > max_interval )) && poll_interval=$max_interval
done

Exponential backoff still needs a cap. Without one, the last check on a long job lands minutes later, creating a window where it is already done but you have not noticed. I settled on a 30-second cap. Short jobs settle in a few rounds; long jobs are watched quietly at 30-second intervals.

Remembering settled jobs in FINAL_STATE and not re-querying them is a small but real win. Continuously polling the handful that finished early needlessly inflates the API calls.

join: treat timeout and partial failure as distinct

The essence of the join is the decision after the loop exits. Writing it as a binary "all succeeded or all failed" does not match reality. In practice an in-between state like "10 succeeded, 1 failed, 1 timed out" is perfectly normal.

ok=0; failed=0; timed_out=0
declare -a FAILED_LABELS=()
 
for jid in "${JOB_IDS[@]}"; do
  st="${FINAL_STATE[$jid]:-timeout}"
  case "$st" in
    succeeded) ok=$((ok+1)) ;;
    failed|cancelled)
      failed=$((failed+1))
      FAILED_LABELS+=("${JOB_LABEL[$jid]}")
      # Pull only the failed job's log and record it
      agy jobs logs "$jid" --tail 40 >> "$HOME/agy-night/failed-$jid.log" 2>&1 || true
      ;;
    timeout)
      timed_out=$((timed_out+1))
      FAILED_LABELS+=("${JOB_LABEL[$jid]} (timeout)")
      ;;
  esac
done
 
echo "join: ok=$ok failed=$failed timeout=$timed_out"
 
# Exit-code design: partial failure is 2, total wipeout is 1, all good is 0
if (( ok == ${#JOB_IDS[@]} )); then
  exit 0
elif (( ok == 0 )); then
  exit 1
else
  printf 'partial failures:\n'; printf '  - %s\n' "${FAILED_LABELS[@]}"
  exit 2
fi

I made the exit code three-tiered because I wanted the upstream scheduler and log aggregator to treat outcomes differently. A total wipeout (exit 1) is an anomaly worth an immediate alert, while a partial failure (exit 2) is usually fine for a human to review in the morning. Collapse both into one "failure" and the truly urgent wipeout gets buried under the everyday one-job-failed notification.

Pulling only the failed jobs' logs with agy jobs logs --tail into a separate file is another habit I added after operating it. Dump every job's log and there is too much to read in the morning, so you stop reading. Slice out just the few that fell over and the path to the cause gets shorter.

Gotchas I hit in production

In the first week after going async, a few things stung. I am recording them here.

First, a detached job keeps running even after the parent shell exits. That is a benefit, but if you accidentally launch the wait script twice, the same task runs twice. I guard against double launches with a lock file: a simple check that bails immediately if it cannot take $HOME/agy-night/.lock via flock.

Second, jobs that sit in queued without moving. When you hit a rate limit or quota cap, a job can linger in queued rather than going failed. I detect "jobs that have not even reached running within a set time" separately and review them by hand the next morning, distinct from timeouts.

Third, the lifetime of a job ID. agy jobs get returns state for a while after completion, but past the retention window it disappears. For long batches that cross the overall timeout mid-wait, querying later sometimes returned nothing unless I had burned the settled state into FINAL_STATE on the spot. The in-loop recording design comes straight from this.

Wiring the scheduler and notifications

This wait script is called once a day from an upstream scheduler. In my case I want it to run between release work for the App Store and Google Play, so it starts at a fixed time late at night.

Notifications branch on the exit code. exit 1 (total wipeout) goes to an immediate push notification, exit 2 (partial failure) goes to a digest I read in the morning, and exit 0 sends nothing — three channels. Sending nothing on full success is deliberate: if a "succeeded" notification arrives every morning, people eventually stop reading it and miss the one that actually signals trouble.

What paid off in production was attaching each job's duration to its failure label. I compute the gap between started_at from agy jobs get and the completion time, and print something like audit:internal-links (12m, failed). A failure whose duration is far longer than usual is the first clue to suspect a rate limit or a network anomaly. These small observations shorten the next morning's diagnosis. I recommend splitting notification granularity this far; it may look excessive at first, but it pays off the longer you operate.

Deciding how much to make async

Making everything async is not the goal. I kept tasks where a later stage depends on an earlier stage's output serial. For example, "generate an article, then link to it from other articles" cannot link until generation finishes.

What I run asynchronously is only tasks that are mutually independent and do not propagate failure. Article generation, standalone reports, asset swaps — if one falls over, the rest finish fine. For anything with dependencies, I group it by dependency into a single job and preserve order inside that job.

Going async was not a magic speedup; it was the design work of separating "the part where waiting may overlap" from "the part where order must hold." Once that line is drawn, agy --detach plus a small wait loop makes the nightly batch remarkably quiet.

I hope this gives anyone running independent tasks overnight a useful starting point to rework their own operations.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.