When Streaming Works Locally but Arrives All at Once in Production — Field Notes on Proxy Buffering and Silent Stalls
Stream Gemini through Antigravity over SSE and it flows token-by-token on localhost, then freezes for seconds and dumps the whole answer in production. Field notes on measuring the stall first, then killing proxy buffering, idle disconnects, and reconnect-driven double generation.
On my dev server, the generated text flowed out one character at a time, exactly as intended. The moment I deployed behind Cloudflare, users saw something else entirely: a few seconds of nothing, then the whole answer appearing at once. No errors. The logs showed clean 200s. Yet the experience had erased every reason I had bothered to implement streaming in the first place.
When streaming breaks in production, it almost always breaks as silence. No exception, no 5xx — just the time-to-first-token quietly getting worse. So the first move is not to start editing config files. It is to see, in numbers, where the data is piling up. These are my field notes from putting an Antigravity-to-Gemini streaming setup into production: the stalls I hit, and how I measured each one before fixing it.
Suspect the path, not the model
When streaming feels slow, the instinct is to blame the model. But in most cases the model has already emitted its first chunk in around half a second, and that chunk is simply being held somewhere on its way to the browser.
You only need two metrics. The first is TTFB — how long until the first chunk reaches the client. The second is the inter-chunk gap — the interval between consecutive chunks arriving. Record the timestamp when the server receives each chunk from the model, and separately when the browser paints it, and the contrast tells you immediately whether the stall lives in the model or in the path.
# server: time each chunk as the model hands it to usfrom google import genaiimport timeclient = genai.Client(api_key="YOUR_GEMINI_API_KEY")def measure_stream(prompt: str): t0 = time.monotonic() first = None last = t0 gaps = [] stream = client.models.generate_content_stream( model="gemini-3.5-flash", contents=prompt, ) for chunk in stream: now = time.monotonic() if first is None: first = now - t0 # TTFB (model -> server) gaps.append(now - last) # inter-chunk gap last = now gaps.sort() p95 = gaps[int(len(gaps) * 0.95)] if gaps else 0 print(f"TTFB(model->server)={first*1000:.0f}ms gap_p95={p95*1000:.0f}ms chunks={len(gaps)}")
If the server-side TTFB is 400–700ms but the browser feels like 4–6 seconds, the culprit is the path, full stop. In my measurements, path buffering alone inflated the perceived TTFB by 8–10x. The model is doing its job. Someone in between is hoarding the bytes.
Where chunks get held along the path
That "someone in between" is rarely a single party. Every hop the request passes through can buffer independently. Here are the spots that have bitten me, roughly in the order I ran into them.
Where it buffers
Symptom
How to stop it
Response compression (gzip/brotli)
Holds output until the compression buffer fills
Disable compression for SSE responses only
Nginx / reverse proxy
proxy_buffering on buffers the whole response
proxy_buffering off on that location
CDN (Cloudflare, etc.)
Edge buffers the body and forwards in bulk
Cache-Control: no-transform, no compression, keep chunked transfer
WSGI server (sync gunicorn worker)
Drains the generator fully before responding
Move to ASGI (uvicorn) + async generator
App write path
Missing flush leaves bytes in the OS buffer
Flush explicitly per chunk
The key thing is that these are an AND condition. Turning off Nginx buffering alone does nothing if compression upstream is still hoarding. So don't give up after "I fixed one and it didn't help." Work top to bottom until the TTFB number actually moves. In production I triage in this order:
Disable compression for SSE responses only, then re-measure
Turn off reverse-proxy buffering on that location alone
Set the CDN to no-transform with chunked transfer kept, and confirm the edge isn't hoarding
If those three don't move the TTFB number, the cause narrows to your app's own missing flush or a sync WSGI worker. Keeping the order matters: if a lower layer is buffering while you fix an upper one, the fix won't show in the numbers and you'll misread it.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Measure TTFB and inter-chunk gap (p95) to tell in five minutes whether the stall is in the model or in the network path
✦Concrete settings to stop buffering at four hops — Nginx, CDN, compression, WSGI — plus heartbeats that survive idle timeouts
✦Why EventSource auto-reconnect causes double generation and double billing, and the idempotency-key code that stops it
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Server side — an SSE stream that refuses to buffer
Here is the FastAPI (ASGI) shape with the right flush and anti-buffering headers. Setting media_type to text/event-stream is not enough for production. You need X-Accel-Buffering: no and Cache-Control: no-cache, no-transform as well, so that each hop is explicitly told: do not hold this.
from fastapi import FastAPI, Requestfrom fastapi.responses import StreamingResponsefrom google import genaiimport anyioapp = FastAPI()client = genai.Client(api_key="YOUR_GEMINI_API_KEY")async def sse_stream(prompt: str, request: Request): # a single comment line opens the path immediately -> lowers TTFB yield ": open\n\n" loop_stream = await anyio.to_thread.run_sync( lambda: client.models.generate_content_stream( model="gemini-3.5-flash", contents=prompt ) ) last_beat = anyio.current_time() for chunk in loop_stream: # if the client left, stop now (no wasted generation or billing) if await request.is_disconnected(): break text = getattr(chunk, "text", "") or "" if text: yield f"data: {text}\n\n" # 15s of silence -> heartbeat to survive idle disconnects now = anyio.current_time() if now - last_beat > 15: yield ": ping\n\n" last_beat = now yield "event: done\ndata: [DONE]\n\n"@app.get("/stream")async def stream(prompt: str, request: Request): return StreamingResponse( sse_stream(prompt, request), media_type="text/event-stream", headers={ "Cache-Control": "no-cache, no-transform", "X-Accel-Buffering": "no", "Connection": "keep-alive", }, )
Three quick notes. The leading : open comment opens the path before any content arrives, shaving the perceived TTFB. The request.is_disconnected() check stops generation the instant a user closes the tab — without it, the model keeps generating a response nobody is reading while the token bill quietly climbs. And the heartbeat comment keeps a proxy from mistaking a long, silent thinking pause for a dead connection and cutting it.
Idle disconnect — the other kind of silence
Even after you've killed every buffer, the stream can still "freeze midway." This time the culprit is the idle timeout. Proxies and load balancers close connections that go quiet for too long. When the model enters a long reasoning step and emits nothing for tens of seconds, that silence trips the disconnect.
The nasty part is that no error reaches the closed side. The server generator only notices at its next yield, if at all. The client screen sits frozen on the last character it received.
The fix is the heartbeat above. In SSE, any line starting with a colon is treated as a comment and ignored, so sending : ping\n\n periodically keeps the connection alive without polluting the payload. I measure the production idle timeout first, then heartbeat at half that interval — a 30-second timeout means a beat every 15. In practice that margin made the disconnects essentially disappear.
Reconnects quietly cause double generation
This was the deepest trap. The browser's EventSourcereconnects automatically when the connection drops. That looks like a convenience, but paired with a generative API it is dangerous. Each reconnect calls your /stream again, and the same prompt fires a second generation. The user sees the text rewind and repeat; you get billed for the tokens twice.
Since EventSource won't let you disable reconnection, the reliable place to block the duplicate is the server, by making generation idempotent. Carry an idempotency key on the request and refuse to start a fresh generation if one with the same key is in flight or already done.
import hashlibfrom fastapi import FastAPI, Requestfrom fastapi.responses import StreamingResponse# use Redis in production; in-memory here just to show the idea_inflight: dict[str, str] = {}def idem_key(prompt: str, cid: str) -> str: return hashlib.sha256(f"{cid}:{prompt}".encode()).hexdigest()[:16]@app.get("/stream")async def stream(prompt: str, cid: str, request: Request): key = idem_key(prompt, cid) if _inflight.get(key) == "running": # duplicate call from a reconnect: don't regenerate async def already(): yield "event: dup\ndata: in-progress\n\n" return StreamingResponse(already(), media_type="text/event-stream") _inflight[key] = "running" async def guarded(): try: async for evt in sse_stream(prompt, request): yield evt finally: _inflight.pop(key, None) return StreamingResponse( guarded(), media_type="text/event-stream", headers={"Cache-Control": "no-cache, no-transform", "X-Accel-Buffering": "no"}, )
On the client, put a cid (an ID that uniquely identifies the conversation + message) in the URL, and remember to call close() yourself when you receive done. Without it, EventSource will try to reconnect even after a clean finish and keep hammering the idempotency guard for nothing.
const cid = crypto.randomUUID();const es = new EventSource(`/stream?cid=${cid}&prompt=${encodeURIComponent(prompt)}`);es.onmessage = (e) => { output.textContent += e.data; };es.addEventListener("done", () => es.close()); // always close on clean finishes.onerror = () => es.close(); // don't trust auto-reconnect
As an indie developer running an app business alongside several Dolice sites in parallel, I missed this double generation for a while myself. Nothing showed up in the error logs — only the month-end token usage came in higher than expected, and tracing it back led to the reconnects. Silent bugs, I was reminded here, tend to surface not from error monitoring but from a "the numbers don't add up" feeling.
Keep the measurement so regressions surface early
Fixing it once isn't the end. A CDN config change or a library update can revive path buffering later. So instead of fixing and forgetting, keep recording the p95 of TTFB and inter-chunk gap in production, with a threshold that alerts you. My primary alert signal is the gap between server-side TTFB and client-side TTFB. The moment that gap widens is the moment something on the path has started hoarding again.
For your next step, drop the measurement snippet into your live SSE endpoint and record the server-side and browser-side TTFB side by side, just once. The difference between those two numbers will tell you, in the first five minutes, where the silence is hiding in your particular stack.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.