ANTIGRAVITY LABJP
Articles/App Development
App Development/2026-06-28Advanced

When Streaming Works Locally but Arrives All at Once in Production — Field Notes on Proxy Buffering and Silent Stalls

Stream Gemini through Antigravity over SSE and it flows token-by-token on localhost, then freezes for seconds and dumps the whole answer in production. Field notes on measuring the stall first, then killing proxy buffering, idle disconnects, and reconnect-driven double generation.

antigravity401streaming6ssefastapi2nginxproduction68advanced20

Premium Article

On my dev server, the generated text flowed out one character at a time, exactly as intended. The moment I deployed behind Cloudflare, users saw something else entirely: a few seconds of nothing, then the whole answer appearing at once. No errors. The logs showed clean 200s. Yet the experience had erased every reason I had bothered to implement streaming in the first place.

When streaming breaks in production, it almost always breaks as silence. No exception, no 5xx — just the time-to-first-token quietly getting worse. So the first move is not to start editing config files. It is to see, in numbers, where the data is piling up. These are my field notes from putting an Antigravity-to-Gemini streaming setup into production: the stalls I hit, and how I measured each one before fixing it.

Suspect the path, not the model

When streaming feels slow, the instinct is to blame the model. But in most cases the model has already emitted its first chunk in around half a second, and that chunk is simply being held somewhere on its way to the browser.

You only need two metrics. The first is TTFB — how long until the first chunk reaches the client. The second is the inter-chunk gap — the interval between consecutive chunks arriving. Record the timestamp when the server receives each chunk from the model, and separately when the browser paints it, and the contrast tells you immediately whether the stall lives in the model or in the path.

# server: time each chunk as the model hands it to us
from google import genai
import time
 
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
 
def measure_stream(prompt: str):
    t0 = time.monotonic()
    first = None
    last = t0
    gaps = []
    stream = client.models.generate_content_stream(
        model="gemini-3.5-flash",
        contents=prompt,
    )
    for chunk in stream:
        now = time.monotonic()
        if first is None:
            first = now - t0          # TTFB (model -> server)
        gaps.append(now - last)        # inter-chunk gap
        last = now
    gaps.sort()
    p95 = gaps[int(len(gaps) * 0.95)] if gaps else 0
    print(f"TTFB(model->server)={first*1000:.0f}ms  gap_p95={p95*1000:.0f}ms  chunks={len(gaps)}")

If the server-side TTFB is 400–700ms but the browser feels like 4–6 seconds, the culprit is the path, full stop. In my measurements, path buffering alone inflated the perceived TTFB by 8–10x. The model is doing its job. Someone in between is hoarding the bytes.

Where chunks get held along the path

That "someone in between" is rarely a single party. Every hop the request passes through can buffer independently. Here are the spots that have bitten me, roughly in the order I ran into them.

Where it buffersSymptomHow to stop it
Response compression (gzip/brotli)Holds output until the compression buffer fillsDisable compression for SSE responses only
Nginx / reverse proxyproxy_buffering on buffers the whole responseproxy_buffering off on that location
CDN (Cloudflare, etc.)Edge buffers the body and forwards in bulkCache-Control: no-transform, no compression, keep chunked transfer
WSGI server (sync gunicorn worker)Drains the generator fully before respondingMove to ASGI (uvicorn) + async generator
App write pathMissing flush leaves bytes in the OS bufferFlush explicitly per chunk

The key thing is that these are an AND condition. Turning off Nginx buffering alone does nothing if compression upstream is still hoarding. So don't give up after "I fixed one and it didn't help." Work top to bottom until the TTFB number actually moves. In production I triage in this order:

  1. Disable compression for SSE responses only, then re-measure
  2. Turn off reverse-proxy buffering on that location alone
  3. Set the CDN to no-transform with chunked transfer kept, and confirm the edge isn't hoarding

If those three don't move the TTFB number, the cause narrows to your app's own missing flush or a sync WSGI worker. Keeping the order matters: if a lower layer is buffering while you fix an upper one, the fix won't show in the numbers and you'll misread it.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Measure TTFB and inter-chunk gap (p95) to tell in five minutes whether the stall is in the model or in the network path
Concrete settings to stop buffering at four hops — Nginx, CDN, compression, WSGI — plus heartbeats that survive idle timeouts
Why EventSource auto-reconnect causes double generation and double billing, and the idempotency-key code that stops it
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

App Dev2026-03-31
Antigravity × gRPC & Protocol Buffers — A Practical Guide to High-Performance Microservice API Design
Learn to design and implement gRPC + Protocol Buffers microservice APIs using Antigravity's AI agents. Covers schema-driven development, streaming patterns, authentication, and error handling for production systems.
App Dev2026-05-03
Building Idempotency Keys and Dedupe Stores in TypeScript with Antigravity
A production guide to designing idempotency keys and dedupe stores in TypeScript with Antigravity — covering Stripe webhook retries, Temporal replays, and the Cloudflare KV / Redis / Postgres trade-offs you actually need to choose between.
App Dev2026-05-01
Zero-Downtime Database Migrations with Antigravity: The Expand-Contract Pattern in Production
A complete production guide to running breaking schema changes—type swaps, column renames, table splits—with zero user-facing downtime, using the Expand-Contract pattern with Antigravity's AI assistance.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →