Don't Lose Failed Agent Jobs: Designing a Dead-Letter and Requeue Path
Scheduled agents fail silently overnight and the work simply vanishes. Here is how to catch those failures with a dead-letter store and a staged requeue, drawn from running four sites on autopilot as an indie developer.
The agent you ran overnight reports "5 succeeded, 0 failed." Then you open the sites and three of them never updated. As an indie developer, I run four blogs that update themselves every day on a staggered, off-peak schedule, and this kind of silent failure has tripped me up more times than I'd like to admit.
What's dangerous isn't the failure itself — it's the failure that disappears without a trace. Scheduled agents run when nobody is watching. The moment an exception gets swallowed and the loop returns zero, that work effectively never existed. A dead-letter store exists to make sure those "erased" failures always land in one place.
Why "0 failures" becomes a lie
The most common cause is a try/except inside the loop that eats the exception, logs nothing, and moves on. Driving an Antigravity agent from a script hits the same trap. Timeouts, rate limits, expired tokens, a momentary network blip — they all look transient, so it's tempting to think "next run will pick it up."
But if the next run faces the same condition, it fails in the same place. The dropped job is gone from the queue, with no breadcrumb to requeue it. In my case, a job that fetched AdMob reports quietly failed for three straight days over a weekend, and I only noticed on Monday when the dashboard numbers had a hole in them.
The design principle is simple: don't discard failures, divert them. That single shift changes how much you can trust the system.
What to store in the dead letter
The minimum you should keep is enough context to reproduce the same job on requeue. An error message alone won't cut it. Append the following fields one record per line as JSON Lines, and later grep or jq work becomes effortless.
Field
Purpose
job_id
Uniquely identifies the work (used to prevent duplicate requeues)
payload
The full input needed to reproduce the job
error_class
Exception type; the key for deciding whether to requeue
attempt
How many times it has been tried so far
failed_at
Failure time (always stored in UTC)
The implementation is surprisingly short. Keeping it append-only means concurrent agents rarely corrupt lines, which makes operations easier.
Recording error_class properly lets you later sort failures mechanically into "will heal if we wait" versus "a human needs to look." That distinction is the foundation of the next layer.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A 30-line JSON Lines dead-letter writer that stops failures from being swallowed
✦A three-tier requeue model (immediate / delayed / manual) that avoids both infinite retries and lost work
✦A 5-minute morning check that keeps silent failures near zero across four daily pipelines
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Retrying every failure uniformly causes two accidents: you either burn cost retrying non-transient failures (like malformed input) forever, or you get scared and never requeue anything, losing work. I split requeue into three tiers.
Immediate retry (2–3 times, on the spot)
Reserved for clearly transient failures like a network blip or a 503. Cap the count hard at two or three, and let the interval follow a backoff. The point is not to be stubborn here — a failure that doesn't clear in three tries is a different animal.
Delayed requeue (pick it up next cycle)
Jobs that immediate retry couldn't fix get pulled back from the dead letter on the next run cycle. Rate limits (429) and expired tokens — failures that heal with time — belong here. Match on job_id at the entry point and skip anything already completed this cycle to avoid double execution.
Manual review (an isolation lane for humans)
Schema mismatches, missing permissions, unexpected exception types — failures that won't heal with time — get pulled out of requeue and isolated. Anything whose attempt count exceeds the cap (I set mine to 5) also goes to the same lane. Because only this lane needs your eyes in the morning, the review burden drops dramatically.
TRANSIENT = {"TimeoutError", "ConnectionError", "RateLimitError"}MAX_ATTEMPT = 5def classify(record: dict) -> str: if record["attempt"] >= MAX_ATTEMPT: return "manual" if record["error_class"] in TRANSIENT: return "retry" return "manual"
When in doubt, ask "will this heal if I wait?" Put non-healing failures in retry and cost balloons; put healing ones in manual and human effort grows. The accuracy of this sorting directly sets how quiet your operations feel.
An idempotency key to prevent double execution
The scariest part of requeue is running the same job twice. Double-posting an article or running a billing step twice is a bigger accident than the dead letter itself. Use job_id as an idempotency key and record completed IDs in a separate file.
DONE_PATH = os.path.expanduser("~/agent/completed.jsonl")def already_done(job_id: str) -> bool: if not os.path.exists(DONE_PATH): return False with open(DONE_PATH, encoding="utf-8") as f: return any(job_id == json.loads(line)["job_id"] for line in f if line.strip())
Routing every requeue through this check at the top of the loop avoids the case where a job sits in the dead letter but actually succeeded last time. I once skipped this and double-pushed an article, then had to retire one copy with a 410 — so I recommend it strongly.
A morning check that takes five minutes
Building the machinery is pointless without the habit of looking at it. Every morning I check only the isolation lane with the command below. Collapsing it to a single line means quiet days take a few seconds.
If the same error_class suddenly spikes versus yesterday, that's a sign of a structural change, not an individual job failure. It often traces back to a model update on the Antigravity side or an API contract change, so in my operation I treat this "delta" as a first-line alert. Watching the day-over-day difference catches anomalies faster than staring at absolute counts.
Build it in, and failure stops being something to fear. Divert it, sort it by nature, and guard only against double execution — those three steps alone let you trust an agent that runs while you sleep. If you're taking the next step, start by dropping to_dead_letter into just one of your live scheduled jobs. I hope it helps anyone wrestling with the same quiet failures.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.