When the Edge Cache Pinned Next.js Error Pages: A cache-worker Guard Design

Users reported intermittent 'failed to load' errors I could never reproduce. The cause: SSR exceptions shipped as HTTP 200 and pinned by the edge cache. Here is how I narrowed it down with an Antigravity agent and added a cache-worker guard to stop it.

antigravity⁴³⁶ cloudflare⁶ nextjs⁴ edge-cache² ssr

✦ Premium Article

"Once in a while the site shows a 'failed to load' error, but a reload fixes it." Running a set of technical blogs as an indie developer, I started getting reports like this. The frustrating part was that I could never reproduce it. Dozens of reloads, always fine. My logs showed almost no 5xx responses.

The cause, as it turned out, was this: Next.js was shipping an exception thrown mid-SSR as an HTTP 200, and Cloudflare's edge cache was pinning that broken response for hours. Because it returns 200, monitoring never flags it; because it gets cached, only a subset of visitors keeps hitting the broken page. That combination is genuinely hard to observe. Below is how I traced it and the cache-worker guard I added so it would not happen again.

The blind spot: an error page that returns 200

App Router streams the body. The status code is flushed before the body, so once 200 OK has gone out, an exception thrown later during React rendering can no longer rewrite the status. Instead, the error.tsx UI is streamed as a continuation of the same body.

So from the visitor's side it is a "failed to load" screen, while the HTTP status is 200 OK. That was the first blind spot. A dashboard that counts 5xx sees nothing wrong.

There was a second path too. Article bodies load static HTML through getCloudflareContext().env.ASSETS.fetch(). During the brief moment a deploy flips over, that ASSETS read can come up empty. No exception is thrown, but the article page renders with an empty body and still returns 200. The error screen and the empty-body page look different, but they fall into the same hole: a "broken 200."

Why the edge cache made it worse

Layer the Cloudflare Workers cache-worker on top and the problem amplifies. The cache-worker at the time was naive — it cached any 200 HTML to the edge for four hours, essentially unconditionally.

// The naive version that caused the problem
const res = await fetch(request);
if (res.status === 200 && isHTML(res)) {
  const toCache = res.clone();
  ctx.waitUntil(cache.put(request, toCache)); // stored without inspecting the body
}
return res;

Deploy flips happen a dozen-plus times a day. If a "broken 200" generated in that window happens to land in the cache, then for the next four hours only visitors routed to that edge location keep getting the broken page. The reason it never reproduced for me was simple: I was being served a healthy cache from a different location. The "a reload fixes it" reports also line up once you account for cache variance and deploy intervals.

The docs say "200 is cacheable," but under streaming SSR, "status 200" and "the body is intact" are two different things. Conflating them was the design mistake.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦If users report intermittent load errors you can never reproduce, you'll understand how streaming SSR exceptions ship as HTTP 200 and pinpoint the root cause

✦You'll be able to write a ~30-line cache-worker guard that refuses to store pages containing an error marker, a missing </html>, or an empty content container

✦You'll learn the design tradeoffs for suppressing empty pages during deploy transitions using a single ASSETS retry and a no-store header on 5xx, ready to apply to your own Cloudflare Workers project

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Narrowing it down with an Antigravity agent

Since I could not reproduce it, the only path forward was to form hypotheses from logs and behavior and knock them down one by one. I used an Antigravity agent for this. Letting it read the whole codebase and then handing the investigation off as a single task pays off especially well for these non-reproducible bugs.

The instruction I gave looked roughly like this:

Symptom: users report intermittent "failed to load" / empty body. Cannot reproduce locally.
Please:
1. Confirm which status code error.tsx / global-error.tsx are served with.
2. Trace how the page renders when the ASSETS-backed body read fails.
3. Enumerate every path by which the cache-worker could store a "broken 200".
Present hypotheses and repro conditions, citing the exact lines of code as evidence.

The agent showed, from the code, that error.tsx renders into the body as an error boundary while the response stays 200, and it flagged that the cache-worker's store condition depended on status === 200 alone. What helped was that it surfaced a path a human tends to unconsciously rule out — "surely an error page would never get cached" — without that bias.

My own rule here is not to take the agent's hypothesis on faith. I opened the cited lines myself, planted a temporary marker string in error.tsx, deployed to staging, and verified with curl -sI that the status was 200 and that the body contained the marker before moving on to a fix. The harder a bug is to reproduce, the more I avoid skipping the verification step.

Fix 1: give the error page a detectable marker

For a cache layer to decide "this page is broken," the error UI itself first needs a machine-readable marker. I added a single data attribute to the root of error.tsx and global-error.tsx.

// app/[locale]/error.tsx
"use client";
 
export default function Error({
  error,
  reset,
}: {
  error: Error & { digest?: string };
  reset: () => void;
}) {
  return (
    // data-error-boundary is the signal the cache layer keys off
    <div data-error-boundary="1" className="error-shell">
      <h1>Failed to load</h1>
      <button onClick={() => reset()}>Reload</button>
    </div>
  );
}

With the same attribute on global-error.tsx, the rule becomes a one-liner: "do not cache any HTML whose body contains data-error-boundary."

Fix 2: a cache-worker guard that rejects broken 200s

This is the core. I rewrote the store condition to inspect the body, not just the status. I deliberately kept the checks crude and biased toward not caching.

// Guard so a broken 200 never gets cached
function isCacheableHtml(bodyText) {
  // (1) never store a page carrying the error-boundary marker
  if (bodyText.includes('data-error-boundary')) return false;
 
  // (2) HTML cut off mid-stream (truncated streaming) — do not store
  if (!bodyText.includes('</html>')) return false;
 
  // (3) empty article container — likely a failed ASSETS read
  const m = bodyText.match(/<div[^>]*class="[^"]*article-content[^"]*"[^>]*>([\s\S]*?)<\/div>/);
  if (m && m[1].trim().length < 20) return false;
 
  return true;
}
 
async function handleResponse(request, res, ctx, cache) {
  if (res.status !== 200 || !isHTML(res)) {
    return res; // non-HTML and non-200 are never stored
  }
  // read the body to decide; rebuild the returned response from a clone
  const cloned = res.clone();
  const bodyText = await cloned.text();
 
  if (isCacheableHtml(bodyText)) {
    ctx.waitUntil(cache.put(request, new Response(bodyText, res)));
  }
  // even if broken, serve the raw response for this request (recover next time)
  return res;
}

Three things matter. First, the signals are limited to crude, stable, surface-level ones: the marker, the presence of </html>, and the content container. Make the condition clever and the guard itself becomes the next source of bugs. Second, I read the body via res.clone() and still return the original res to the visitor — a body can be read only once, so getting this wrong breaks healthy pages too. Third, I accept the tradeoff of "serve the broken page for this one request, but keep it out of the cache." As long as pinning is prevented, the next request can pull a healthy version.

Fix 3: a single retry on transient ASSETS failures

I also addressed the empty-body root cause. The ASSETS miss during a deploy transition is momentary, so a single retry after a very short delay fills the body in almost every case.

// content.ts — retry the ASSETS read once
async function readStaticAsset(path: string): Promise<string | null> {
  const env = getCloudflareContext().env;
  for (let attempt = 0; attempt < 2; attempt++) {
    const res = await env.ASSETS.fetch(new Request(`https://assets.local${path}`));
    if (res.ok) {
      const text = await res.text();
      if (text.trim().length > 0) return text;
    }
    if (attempt === 0) await new Promise((r) => setTimeout(r, 50)); // wait just 50ms
  }
  return null; // failed both times → caller returns a 5xx
}

It is tempting to bump the retries past two, but I keep it at one. The transition miss is inherently transient; more attempts only add latency without meaningfully improving the success rate. In practice, a single retry dropped empty-body occurrences to nearly zero.

Fix 4: when it really fails, return 5xx with no-store

If the retry still cannot fetch the body, I stop papering over it and return an honest error. The crucial part is attaching Cache-Control: no-store to the 5xx. Forget that, and now the failure response itself can get cached.

if (content === null) {
  return new Response("Temporarily unavailable", {
    status: 503,
    headers: { "Cache-Control": "no-store" },
  });
}

On top of that, the cache-worker holds a DEPLOY_VERSION constant that I bump on every deploy. That mixes the version into the cache key, so any broken page that somehow got pinned earlier is invalidated at deploy time. It helps to separate the roles: the guard exists to keep broken things out, and DEPLOY_VERSION exists to not retain past ones.

Pitfalls and the order to check things

Two traps I hit. First, I once forgot res.clone(), read the body directly with text(), and returned an empty response to the visitor — a Workers body stream is consumed once, so always split it into store-copy and return-copy. Second, I made the empty-container check too loose and misclassified genuinely short articles as "empty." I tuned the threshold to around 20 characters against real data.

A fast order to check: start with curl -sI to see the status and rule out an error UI returning 200; then curl -s | grep data-error-boundary to confirm the marker; finally, bump DEPLOY_VERSION on staging and watch the broken cache expire. For bugs that never reproduce locally, having a check sequence that closes off each "path into the cache" one at a time removes a lot of the guesswork.

If you want a more systematic grounding in Cloudflare Workers and ASSETS behavior, I'd suggest reading it alongside Practical Notes on Cloudflare Workers Edge SaaS Architecture and Three Weeks Chasing a Workers Bug with wrangler tail, which together make the edge layer easier to picture.

If you are facing the same "can't reproduce it, but users report it" symptom, start by checking which status code error.tsx is served with via curl -sI. The moment that turns out to be 200, half the mystery is already solved. Thanks for reading.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.