Spotting Agents That Are Alive but Stuck — Designing a Progress Heartbeat and Watchdog

The process is alive but the work isn't moving — the nastiest state for a background agent. Here is how to switch from liveness to progress monitoring to detect it, and how to stop it safely, with working code.

background agents³ Antigravity²⁸⁸ watchdog² timeout design operations¹⁹

✦ Premium Article

In the morning, the dashboard was still green. The process was running, it was eating a little CPU, the last log line read "querying the model" — and yet not a single character had advanced in six hours.

As an indie developer, I run an article-generation agent overnight. That night an external API never responded, never even entered retry, and just kept waiting. The process was genuinely alive. So the monitoring that watched "is it alive" kept returning green the whole time. The mistake was treating liveness and progress as the same thing.

"Alive" and "progressing" are different questions

Out of server-monitoring habit, we reach for "does the process exist" and "does the port respond" as health signals. For short requests that is enough. But a background agent can spend minutes to tens of minutes on a single job, and throughout it can be "perfectly present while advancing nothing."

The classic stalls are an external call that never returns, an infinite reasoning loop in front of an unsolvable dependency, and a deadlock where it waits on itself. In all of them the process is alive. Liveness monitoring catches none of them. What you need is progress monitoring.

Kind of monitoring	Question it answers	Catches a stall?
Liveness (process exists, port responds)	Is it running?	No
Log output present	Is it talking?	Fooled by endless "waiting…"
Progress (amount advanced)	Did work move forward?	Yes

Judging by whether logs are flowing is also dangerous. Code that keeps printing "waiting…" is talking but not advancing. Watch the amount of forward motion, not the chattiness.

Emitting progress as a marker

To watch by progress, the agent itself must leave "I got this far" in an externally observable form. Write a monotonically increasing step number and the time it last advanced to a shared location.

# progress.py — stamp a progress marker into a shared file
import json, os, tempfile, time
 
def mark_progress(path: str, step: int, note: str = "") -> None:
    """step is monotonic. Call only when you advanced (not while waiting)."""
    payload = {"step": step, "note": note, "ts": int(time.time())}
    fd, tmp = tempfile.mkstemp(dir=os.path.dirname(path) or ".")
    with os.fdopen(fd, "w") as f:
        json.dump(payload, f)
    os.replace(tmp, path)        # atomic swap

The discipline that matters: call mark_progress only when you advanced. If you design the heartbeat to "always fire on a timer," the pulse continues even while stalled, and you are back to liveness monitoring. A pulse must mean "moved forward," not "still alive."

In the agent body, advance the step at meaningful boundaries.

mark_progress(P, 1, "collection done")
# ... external API call (this is where it tends to freeze) ...
mark_progress(P, 2, "draft generated")
# ... formatting ...
mark_progress(P, 3, "verification passed")

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Understand that 'is it alive' and 'is it progressing' are entirely different questions, and why liveness monitoring alone misses a silent stall

✦A complete implementation from progress heartbeat to no-progress timeout to safe termination (a Python watchdog plus a bash integration). Drop it straight into your own job

✦From the real experience of running an agent overnight as an indie developer and finding it frozen silently until morning, a clear rule for where to place progress markers

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Building a watchdog with a no-progress timeout

A separate watchdog process watches this marker. The rule is simple: if the time since the last advance exceeds a threshold and the step number has not moved, treat it as stalled.

# watchdog.py — if no progress persists, declare a stall and stop the target
import json, os, signal, time
 
def watch(progress_path: str, pidfile: str, stall_sec: int = 600, poll: int = 30):
    last_step, last_change = -1, time.time()
    while True:
        time.sleep(poll)
        try:
            d = json.load(open(progress_path))
        except (FileNotFoundError, json.JSONDecodeError):
            continue                      # not written yet, or mid-write
        if d["step"] != last_step:
            last_step, last_change = d["step"], time.time()   # observed an advance
            continue
        if time.time() - last_change > stall_sec:
            pid = int(open(pidfile).read().strip())
            terminate(pid)                # no progress past threshold -> stop
            return
 
def terminate(pid: int):
    # graceful first (give it time to clean up), then force if it lingers
    try:
        os.kill(pid, signal.SIGTERM)
    except ProcessLookupError:
        return
    time.sleep(20)
    try:
        os.kill(pid, signal.SIGKILL)
    except ProcessLookupError:
        pass

Choose stall_sec by feel: "for this work, silence this long is abnormal." For my generation job, the longest model call measured around 90 seconds, so I set the threshold to several times that, at 10 minutes (600 seconds). Too short kills legitimate long thinking; too long means you only notice in the morning.

Designing the way you stop it

Detecting it is not enough; a crude stop creates a different accident. Killing outright leaves a half-written artifact. So make it two-stage.

Send SIGTERM first and let the agent's signal handler clean up (delete temp files, release locks, record an "aborted" note to the progress marker).
If it still does not finish within a window, use SIGKILL to bring it down for certain.

Install a handler in the agent body that takes on the cleanup.

cleanup() {
  rm -f "$TMP_OUT" 2>/dev/null || true     # leave no half-written output
  release_lease 2>/dev/null || true        # let go of the lock
  echo '{"step":-1,"note":"aborted"}' > "$PROGRESS"
  exit 143                                  # 128 + SIGTERM
}
trap cleanup TERM INT

Leaving that "aborted" note in the progress marker means the next start can tell "last run died on a stall," giving you a starting point for investigation. In production, the presence or absence of that one line completely changed how long the next morning's investigation took.

Common pitfalls and how to avoid them

The first pitfall is decoupling the heartbeat from progress. Beat on a timer and the pulse continues during a stall, so the watchdog never fires. Always tie the pulse to a forward advance.

The second is putting the watchdog inside the same process as the agent. When the body freezes along with its event loop, the built-in timer freezes too. I recommend always running the watchdog as a separate process that observes through a shared file.

The third is using one stall_sec for all jobs. Apply the same threshold to a 30-second formatting job and a 5-minute large generation job and one of them will misfire. Holding a per-job-type threshold is realistic; in the Dolice Labs setup I keep stall_sec in each job's definition metadata and switch it per job.

Where to start

You do not need a perfect monitoring mesh from day one. What worked for me, from real experience, was the minimal version: stamp just three progress markers into the single job whose long freeze hurts most, and point a separate-process watchdog at it.

Run it overnight, and whether a SIGTERM ever fired tells you whether a silent stall is actually happening. If it did not, leave the threshold; if it fires often, treat the cause (usually the external call). Do not fully trust a green dashboard — watch the amount of forward motion. Switching to that single point, I believe, reliably reduces the morning surprise of "somehow it never finished."

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.