Spotting Agents That Are Alive but Stuck — Designing a Progress Heartbeat and Watchdog
The process is alive but the work isn't moving — the nastiest state for a background agent. Here is how to switch from liveness to progress monitoring to detect it, and how to stop it safely, with working code.
In the morning, the dashboard was still green. The process was running, it was eating a little CPU, the last log line read "querying the model" — and yet not a single character had advanced in six hours.
As an indie developer, I run an article-generation agent overnight. That night an external API never responded, never even entered retry, and just kept waiting. The process was genuinely alive. So the monitoring that watched "is it alive" kept returning green the whole time. The mistake was treating liveness and progress as the same thing.
"Alive" and "progressing" are different questions
Out of server-monitoring habit, we reach for "does the process exist" and "does the port respond" as health signals. For short requests that is enough. But a background agent can spend minutes to tens of minutes on a single job, and throughout it can be "perfectly present while advancing nothing."
The classic stalls are an external call that never returns, an infinite reasoning loop in front of an unsolvable dependency, and a deadlock where it waits on itself. In all of them the process is alive. Liveness monitoring catches none of them. What you need is progress monitoring.
Kind of monitoring
Question it answers
Catches a stall?
Liveness (process exists, port responds)
Is it running?
No
Log output present
Is it talking?
Fooled by endless "waiting…"
Progress (amount advanced)
Did work move forward?
Yes
Judging by whether logs are flowing is also dangerous. Code that keeps printing "waiting…" is talking but not advancing. Watch the amount of forward motion, not the chattiness.
Emitting progress as a marker
To watch by progress, the agent itself must leave "I got this far" in an externally observable form. Write a monotonically increasing step number and the time it last advanced to a shared location.
# progress.py — stamp a progress marker into a shared fileimport json, os, tempfile, timedef mark_progress(path: str, step: int, note: str = "") -> None: """step is monotonic. Call only when you advanced (not while waiting).""" payload = {"step": step, "note": note, "ts": int(time.time())} fd, tmp = tempfile.mkstemp(dir=os.path.dirname(path) or ".") with os.fdopen(fd, "w") as f: json.dump(payload, f) os.replace(tmp, path) # atomic swap
The discipline that matters: call mark_progress only when you advanced. If you design the heartbeat to "always fire on a timer," the pulse continues even while stalled, and you are back to liveness monitoring. A pulse must mean "moved forward," not "still alive."
In the agent body, advance the step at meaningful boundaries.
mark_progress(P, 1, "collection done")# ... external API call (this is where it tends to freeze) ...mark_progress(P, 2, "draft generated")# ... formatting ...mark_progress(P, 3, "verification passed")
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Understand that 'is it alive' and 'is it progressing' are entirely different questions, and why liveness monitoring alone misses a silent stall
✦A complete implementation from progress heartbeat to no-progress timeout to safe termination (a Python watchdog plus a bash integration). Drop it straight into your own job
✦From the real experience of running an agent overnight as an indie developer and finding it frozen silently until morning, a clear rule for where to place progress markers
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
A separate watchdog process watches this marker. The rule is simple: if the time since the last advance exceeds a threshold and the step number has not moved, treat it as stalled.
# watchdog.py — if no progress persists, declare a stall and stop the targetimport json, os, signal, timedef watch(progress_path: str, pidfile: str, stall_sec: int = 600, poll: int = 30): last_step, last_change = -1, time.time() while True: time.sleep(poll) try: d = json.load(open(progress_path)) except (FileNotFoundError, json.JSONDecodeError): continue # not written yet, or mid-write if d["step"] != last_step: last_step, last_change = d["step"], time.time() # observed an advance continue if time.time() - last_change > stall_sec: pid = int(open(pidfile).read().strip()) terminate(pid) # no progress past threshold -> stop returndef terminate(pid: int): # graceful first (give it time to clean up), then force if it lingers try: os.kill(pid, signal.SIGTERM) except ProcessLookupError: return time.sleep(20) try: os.kill(pid, signal.SIGKILL) except ProcessLookupError: pass
Choose stall_sec by feel: "for this work, silence this long is abnormal." For my generation job, the longest model call measured around 90 seconds, so I set the threshold to several times that, at 10 minutes (600 seconds). Too short kills legitimate long thinking; too long means you only notice in the morning.
Designing the way you stop it
Detecting it is not enough; a crude stop creates a different accident. Killing outright leaves a half-written artifact. So make it two-stage.
Send SIGTERM first and let the agent's signal handler clean up (delete temp files, release locks, record an "aborted" note to the progress marker).
If it still does not finish within a window, use SIGKILL to bring it down for certain.
Install a handler in the agent body that takes on the cleanup.
cleanup() { rm -f "$TMP_OUT" 2>/dev/null || true # leave no half-written output release_lease 2>/dev/null || true # let go of the lock echo '{"step":-1,"note":"aborted"}' > "$PROGRESS" exit 143 # 128 + SIGTERM}trap cleanup TERM INT
Leaving that "aborted" note in the progress marker means the next start can tell "last run died on a stall," giving you a starting point for investigation. In production, the presence or absence of that one line completely changed how long the next morning's investigation took.
Common pitfalls and how to avoid them
The first pitfall is decoupling the heartbeat from progress. Beat on a timer and the pulse continues during a stall, so the watchdog never fires. Always tie the pulse to a forward advance.
The second is putting the watchdog inside the same process as the agent. When the body freezes along with its event loop, the built-in timer freezes too. I recommend always running the watchdog as a separate process that observes through a shared file.
The third is using one stall_sec for all jobs. Apply the same threshold to a 30-second formatting job and a 5-minute large generation job and one of them will misfire. Holding a per-job-type threshold is realistic; in the Dolice Labs setup I keep stall_sec in each job's definition metadata and switch it per job.
Where to start
You do not need a perfect monitoring mesh from day one. What worked for me, from real experience, was the minimal version: stamp just three progress markers into the single job whose long freeze hurts most, and point a separate-process watchdog at it.
Run it overnight, and whether a SIGTERM ever fired tells you whether a silent stall is actually happening. If it did not, leave the threshold; if it fires often, treat the cause (usually the external call). Do not fully trust a green dashboard — watch the amount of forward motion. Switching to that single point, I believe, reliably reduces the morning surprise of "somehow it never finished."
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.