Keeping Unattended Agent Run Logs Long Enough to Debug — Without Filling the Disk
A scheduled agent is only fixable if you can reconstruct why it failed. Here is how to keep run logs around without filling the disk — tiered retention, schema-versioned records, and a compaction job — drawn from running four sites on autopilot as an indie developer.
Your overnight agent reports "1 failed" — and that's all it tells you. The one thing you actually need, where and why it broke, is already gone. As an indie developer I run four blogs that update themselves every day on a staggered, off-peak schedule, and the first thing that tripped me up wasn't the number of failures. It was not being able to reconstruct why any of them happened.
An unattended agent runs when nobody is watching. That means its logs exist for exactly one purpose: to be read later, by someone trying to figure out what went wrong. But if you naively keep everything, the disk fills within weeks and the logging itself starts failing. This article covers how to keep logs from overflowing while still being able to trace every failure.
When morning comes and you can't tell why it broke
The hard part of scheduled execution is that you're never present at the moment the error happens. When you run something by hand, you read the output as it scrolls past. The output of a job that fires at 2 a.m. simply disappears unless you saved it.
In my case, I started by appending stdout straight to a file. That works for a few days. The problem was that four sites' worth of jobs each run several times daily, so raw logs piled up by tens of megabytes per day. By the weekend the runner had a few hundred megabytes left, and new jobs were failing on write.
So unattended logging has two demands that pull against each other: keep enough detail to investigate later, and keep disk usage flat. To satisfy both, you have to design when logs get deleted from the very start.
Both "keep everything" and "delete immediately" fail
The two extremes each break in their own way.
Keeping everything grows monotonically, so the runner eventually stops on a full disk. Worse, bloated logs are slow to search and end up unread anyway. Deleting immediately — say, keeping only the latest run's output — means that by the time you notice a failure, the log that explains it has already been overwritten. A failure you can't reproduce is a failure you can't fix.
What works is to stop treating all logs the same. Keep the most recent runs in full, keep slightly older runs as summaries only, and discard runs that are old enough. Varying the granularity by freshness lets you cap disk usage while still keeping "why it broke recently" within reach. That is the idea behind tiered retention.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A three-tier retention policy (hot 7 days / warm 30 days / cold deleted) that keeps disk usage flat
✦An exception rule that protects failure logs past their expiry, plus a 30-line schema-versioned record
✦A priority eviction order under disk pressure and a 30-second morning check of last night's runs
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Hot tier (last 7 days, full text): complete records including stdout, exception stacks, and an input summary. Recent failures get diagnosed from here.
Warm tier (8–30 days, summary only): the full text is dropped and each run is compacted to a one-line summary record — success/failure, duration, item count, exit code, and just the first line of any error.
Cold tier (31 days and older): deleted by default. If you need trend analysis, push only daily aggregate numbers to a separate small file.
In concrete numbers, my four-site setup produces 30–50 MB of raw logs per day in total. Holding seven days in the hot tier caps it at roughly 350 MB at steady state. A warm summary is about 200 bytes per run, so even 30 days fits in a few megabytes. The result is that total log footprint stabilizes around 400 MB and no longer grows without bound.
You can also draw the tier boundaries by run count instead of days, but I recommend days. You can search backward from the date of an incident, which makes your intuition work for you during an investigation.
Protect failure logs past their expiry
Tiered retention needs exactly one exception added to it: failed records must not be dropped from the hot tier even after the retention window passes.
The reason is simple — the rarer a failure, the more valuable it is. A job that breaks once a month under a specific condition will have its evidence erased by a 7-day window before it recurs. You'd be left waiting for a repeat with no idea of the cause, which is the single worst position to be in for unattended operation.
In practice, when you select what to compact or delete, only evict records that are both successful and past the window, and keep failures on a separate, longer schedule. I keep failed records in full for up to 90 days, and only thin the oldest ones if they somehow keep growing. Failures rarely exceed a few dozen megabytes in 90 days, so the impact on disk is negligible.
// Eviction candidates: only successful records past the windowtype RunRecord = { schema: number; // schema version (see below) runId: string; site: string; startedAt: string; // ISO8601 ok: boolean; exitCode: number; durationMs: number; itemCount: number; errorHead?: string; // failures only: first line of the error};const DAY = 24 * 60 * 60 * 1000;function isEvictable(r: RunRecord, now: number): boolean { const ageDays = (now - Date.parse(r.startedAt)) / DAY; if (!r.ok) { // Failures are protected for 90 days; never evicted on age alone return ageDays > 90; } // Successes leave the hot tier after 7 days return ageDays > 7;}
This one rule alone goes a long way toward preventing the "it recurred, but last time's log is gone" situation.
Give every record a schema version
A pitfall that bites in long-term operation is changing the log format. As you keep running, you will inevitably hit a moment of "I wish I'd recorded that field too." The instant you add or rename a field, the aggregation script that reads old logs throws on the older lines and stops.
To avoid this, put an integer schema version at the head of every record. The reader branches on the version, refuses to crash on a version it doesn't know, and interprets only the fields it understands.
// Reader: survive unknown schema versions, parse only what you knowfunction parseRecord(line: string): RunRecord | null { let raw: any; try { raw = JSON.parse(line); } catch { return null; // skip broken lines silently (e.g. partial-write debris) } const schema = typeof raw.schema === "number" ? raw.schema : 1; if (schema > 2) { // Ignore future fields; pull out only the common ones return { schema, runId: raw.runId, site: raw.site, startedAt: raw.startedAt, ok: !!raw.ok, exitCode: raw.exitCode ?? -1, durationMs: raw.durationMs ?? 0, itemCount: raw.itemCount ?? 0, errorHead: raw.errorHead, }; } // v1 had no itemCount → fill a default return { itemCount: 0, ...raw, schema };}
The version number costs almost nothing and pays off later: it spares you from bulk-converting all your past logs whenever the schema changes.
Run a single compaction job
Tiered retention does not happen on its own. The work of demoting hot to warm and expiring warm should run as a separate, small job once a day. The key is not to make the main agent do this cleanup. If you bury the cleanup inside the main process, log rotation stops on any day the main process fails — so logs pile up most on exactly the day you most need them.
#!/usr/bin/env bash# rotate-logs.sh — run once a day, separately from the main jobset -euo pipefailLOG_DIR="${HOME}/agent-logs"HOT="${LOG_DIR}/hot.jsonl"WARM="${LOG_DIR}/warm.jsonl"TMP="$(mktemp "${LOG_DIR}/rotate.XXXXXX")"node "${LOG_DIR}/compact.mjs" "$HOT" "$WARM" "$TMP"# Swap atomically (the original stays intact if this dies midway)mv "$TMP" "$HOT"# Check free disk; thin further if it's tightFREE_MB=$(df "$LOG_DIR" --output=avail -m | tail -1 | tr -d ' ')if [ "${FREE_MB:-0}" -lt 300 ]; then echo "low disk: ${FREE_MB}MB — running emergency eviction" node "${LOG_DIR}/evict.mjs" "$WARM"fi
Writing to a temp file and replacing with mv keeps the original hot log intact if the job dies mid-compaction. If you trim a file in place, an interruption during the trim can lose the log itself. Having the cleanup job destroy the very thing it's meant to maintain defeats the purpose, so I recommend the atomic swap here.
A priority eviction order under disk pressure
The day will come when capacity simply runs short. Deciding in advance what to delete first prevents the accident of panic-deleting something important. The order I use is:
Full text of successful records (hot tier). The summary survives in warm, so drop the full text first.
Summary records from the old end of the warm tier, oldest date first.
If that's still not enough, the oldest failed records past 90 days.
The spine of this order is one principle: discard the details of success first, and guard the evidence of failure to the very end. The logs you actually re-read in unattended operation are the failures, not the successes. Evict in the opposite order and you'll free up disk but be left with no clues when it matters.
Note that deletion should write the lines you want to keep into a new file and swap it in, rather than truncating in place. The reason is the same as the compaction job: the original survives even if you're interrupted partway.
A small check to fold into your routine
Finally, a short morning check so this doesn't become "write it and forget it." Because the warm tier is one summary line per run, you can grasp last night's outcomes just by reading it.
# List last night's runs with pass/fail (readable in 30 seconds)node -e 'const fs=require("fs");const lines=fs.readFileSync(process.env.HOME+"/agent-logs/warm.jsonl","utf8").trim().split("\n");const since=Date.now()-24*60*60*1000;for(const l of lines){const r=JSON.parse(l); if(Date.parse(r.startedAt)<since) continue; console.log(`${r.ok?"OK ":"NG "} ${r.site} ${r.itemCount} items ${r.errorHead??""}`);}'
Once glancing at this list becomes a habit, failures almost stop slipping away unnoticed. The goal of log design isn't to save space — it's to reach the reason something broke by the shortest path. Since switching to this three-tier retention, the time I spend on root-cause investigation has dropped by more than half, by feel.
Unattended logs are a handoff note to your future self. Don't hoard the full text forever, and don't keep only the latest while throwing away the past. Vary the granularity by freshness, and guard the evidence of failure for the long haul. Draw that line up front and you get both a flat disk and peace of mind. I hope this helps anyone else running several jobs unattended.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.