Turning Last Night's Failed Runs into Tomorrow's Prevention — Designing a Postmortem Feedback Loop
Stop letting unattended failures end at a notification. A concrete design for classifying failures and feeding fixes back into Guide skills, gates, and schedules, with measured recurrence rates.
An agent run scheduled for 2 a.m. fails, and all that greets you in the morning is a notification. You skim the log, mutter "timeout again," patch something, rerun it, and move on. A few days later a suspiciously similar failure shows up in a different task. As an indie developer running several unattended jobs every night, I lived in that loop longer than I'd like to admit.
The diagnosis is simple. I was responding to failures, but nothing carried the lesson back into my prompts, gates, or schedules. There was no return path.
Right after a failure, you are in "just make it pass" mode. Once the rerun succeeds, the motivation to record a root cause evaporates.
So I split the roles by time of day. At night, the only automated reactions are a retry and a notification. Classification and correction happen in a fixed five-minute slot the next morning. Since adopting that separation, the quality of my follow-ups stopped depending on how sleepy I was.
Thinning out the immediate response only works if every run leaves machine-readable evidence behind. That is the foundation.
Recording evidence as a run record
Every run, pass or fail, writes one JSON file on exit. Mine looks like this:
The field that earns its keep is phase. Slice the run into stages — prepare, generate, quality gate, push, log — and record where it died. Most of the classification below falls out of that one field.
configHash is a hash over the prompt and config files together. It exists to answer "did failures spike right after I changed the config?" — a question it has settled for me twice already.
The record is written by a wrapper script using a trap:
The task itself only needs to update export AGENT_PHASE=generation as it moves between stages. Existing tasks barely change.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A five-way failure taxonomy where each class maps to exactly one place to fix
✦A run-record JSON schema plus a script that turns yesterday's failures into a five-minute morning digest
✦Field data from cutting same-cause recurrence from roughly 40% to 12% over six weeks
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
The morning review assigns each failed record to one of five classes. The crucial design decision is that each class has exactly one predetermined place where the fix goes.
Class
Typical example
Return path
Environment
Disk full, expired auth, network outage
Add a preflight check item
Context
Missing reference file, stale repository
Preconditions section of the Guide skill
Prompt
Output format drift from ambiguous instructions
The task prompt itself
Upstream change
CLI update or model switch altering behavior
Version pinning and canary settings
By design
A gate correctly rejected output, schedule collision
Gate threshold or schedule definition
Why exactly one return path? Because if you fix three things at once, you never learn which one worked. One failure, one correction; if it does not stick, revisit the classification the following week. Unglamorous, but it converged faster than anything else I tried.
Note that the fifth class includes cases where a quality gate did its job. Those are closed as "no fix needed." Accepting up front that not every failure demands a correction keeps the review light.
The digest script that makes five minutes real
Yesterday's records get collapsed into one screen:
#!/usr/bin/env bash# morning-digest.sh — aggregate yesterday's failed runsDAY="${1:-$(date -d yesterday +%Y-%m-%d)}"DIR="$HOME/.agent-runs/$DAY"[ -d "$DIR" ] || { echo "no records for $DAY"; exit 0; }echo "== $DAY failed runs =="for f in "$DIR"/*.json; do jq -r 'select(.exitCode != 0) | "\(.task)\t phase=\(.phase)\t \(.lastOutputTail | gsub("\n"; " ") | .[0:80])"' "$f"done | sort | uniq -c | sort -rn
The uniq -c is deliberate: if the same task failed at the same phase more than once, it floats to the top as the obvious priority.
The routine itself is fixed. Read the digest, classify each failure, edit exactly one file at the designated return path, and leave a one-line note pairing the class with the fix. That fits in five minutes. Anything that will not fit gets promoted to regular working hours instead of bloating the review.
Measuring whether the loop works
The health metric is same-cause recurrence: the share of this week's failures that match a class-and-task pair already recorded in the previous four weeks.
In my own operation — four content sites run unattended as a solo Dolice Labs project — recurrence stood at roughly 40% when I started. Four out of ten failures were reruns of something I had already seen. Six weeks after fixing the return path, it hovered around 12%, and absolute failures dropped from about nine per week to just under four.
More valuable than the numbers is the qualitative shift: eventually only novel failures occur. Novel failures point at genuine design gaps, which turns the morning review into something I almost look forward to.
Start by wiring up the run record tonight and nothing else. Classification and reviews can wait until a week of evidence has accumulated. Evidence first, process second — in that order, adoption costs almost nothing.
I hope this helps if you are growing an unattended fleet of your own. Thanks for reading.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.