ANTIGRAVITY LABJP
Articles/Agents & Manager
Agents & Manager/2026-06-20Advanced

Don't Lose Failed Agent Jobs: Designing a Dead-Letter and Requeue Path

Scheduled agents fail silently overnight and the work simply vanishes. Here is how to catch those failures with a dead-letter store and a staged requeue, drawn from running four sites on autopilot as an indie developer.

Antigravity248Agents13Reliability3Scheduling2Dead Letter

Premium Article

The agent you ran overnight reports "5 succeeded, 0 failed." Then you open the sites and three of them never updated. As an indie developer, I run four blogs that update themselves every day on a staggered, off-peak schedule, and this kind of silent failure has tripped me up more times than I'd like to admit.

What's dangerous isn't the failure itself — it's the failure that disappears without a trace. Scheduled agents run when nobody is watching. The moment an exception gets swallowed and the loop returns zero, that work effectively never existed. A dead-letter store exists to make sure those "erased" failures always land in one place.

Why "0 failures" becomes a lie

The most common cause is a try/except inside the loop that eats the exception, logs nothing, and moves on. Driving an Antigravity agent from a script hits the same trap. Timeouts, rate limits, expired tokens, a momentary network blip — they all look transient, so it's tempting to think "next run will pick it up."

But if the next run faces the same condition, it fails in the same place. The dropped job is gone from the queue, with no breadcrumb to requeue it. In my case, a job that fetched AdMob reports quietly failed for three straight days over a weekend, and I only noticed on Monday when the dashboard numbers had a hole in them.

The design principle is simple: don't discard failures, divert them. That single shift changes how much you can trust the system.

What to store in the dead letter

The minimum you should keep is enough context to reproduce the same job on requeue. An error message alone won't cut it. Append the following fields one record per line as JSON Lines, and later grep or jq work becomes effortless.

FieldPurpose
job_idUniquely identifies the work (used to prevent duplicate requeues)
payloadThe full input needed to reproduce the job
error_classException type; the key for deciding whether to requeue
attemptHow many times it has been tried so far
failed_atFailure time (always stored in UTC)

The implementation is surprisingly short. Keeping it append-only means concurrent agents rarely corrupt lines, which makes operations easier.

import json, os, datetime, fcntl
 
DLQ_PATH = os.path.expanduser("~/agent/dead_letter.jsonl")
 
def to_dead_letter(job_id: str, payload: dict, exc: Exception, attempt: int) -> None:
    record = {
        "job_id": job_id,
        "payload": payload,
        "error_class": type(exc).__name__,
        "message": str(exc)[:500],
        "attempt": attempt,
        "failed_at": datetime.datetime.now(datetime.timezone.utc).isoformat(),
    }
    os.makedirs(os.path.dirname(DLQ_PATH), exist_ok=True)
    with open(DLQ_PATH, "a", encoding="utf-8") as f:
        fcntl.flock(f, fcntl.LOCK_EX)
        f.write(json.dumps(record, ensure_ascii=False) + "\n")
        fcntl.flock(f, fcntl.LOCK_UN)

Recording error_class properly lets you later sort failures mechanically into "will heal if we wait" versus "a human needs to look." That distinction is the foundation of the next layer.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A 30-line JSON Lines dead-letter writer that stops failures from being swallowed
A three-tier requeue model (immediate / delayed / manual) that avoids both infinite retries and lost work
A 5-minute morning check that keeps silent failures near zero across four daily pipelines
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Agents & Manager2026-04-29
Teaching Antigravity Agents to Learn from Failure — A Solo Developer's Loop for Reusing Failure History
Antigravity agents repeat the same mistakes because each session starts blank. A solo developer's six-month run with a structured failure log, a separate observer agent, and the side-effect of overfitting.
Agents & Manager2026-06-14
Making My Managed Agents Batch Survive a Crash Without Redoing Everything
Running a 200-item batch on the Managed Agents API kept torching tokens, because every mid-run failure restarted from item one. Here is the checkpoint-and-idempotency design I added so the batch resumes from where it died.
Agents & Manager2026-06-12
Running Gemini's Managed Agents API: Where Cloud Execution Ends and My Local Agents Begin
A hands-on record of launching Gemini's Managed Agents (public preview) from Python — polling, artifact retrieval, and a cost guard — plus five criteria I use to decide what stays on my local CLI agents.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →