ANTIGRAVITY LABJP
Articles/App Development
App Development/2026-04-23Advanced

Semantic Caching for LLM Responses in Antigravity — A Production Guide to Cutting Costs by 80% with Similarity-Based Reuse

A production guide to building a semantic LLM response cache with Antigravity, pgvector, and Gemini. Covers threshold tuning, production pitfalls, monitoring metrics, and runnable implementation code.

antigravity374semantic-cachepgvector2gemini15cost-optimization8production65llm-infrastructure

Premium Article

Have you ever stared at your LLM invoice and realized you're paying full price every time a user asks roughly the same question in slightly different words? I ran into exactly this problem when the Gemini bill for a small side-project chatbot quietly tripled over one month. Digging through the logs, the pattern was obvious: hundreds of semantically identical questions a day, each phrased just differently enough to miss my Redis key-value cache.

A traditional exact-match cache is blind to meaning. This guide walks through building a semantic cache — one that uses embeddings to match queries by intent — using Antigravity as your implementation partner, and taking it all the way to a version that survives production. The finished implementation is modest: pgvector, FastAPI, Gemini. What takes longer is the threshold tuning and the pitfalls you want to avoid. I've tried to document the ones I stumbled into personally, so you don't have to.

By the end, you'll have a working implementation you can paste into an existing FastAPI service, a measurement scaffold that makes threshold tuning data-driven rather than gut-feel, and a practical map of the production hazards that separate "works on my laptop" from "pays for itself three times over each month."

Why LLMs need meaning-based caching, not exact matching

Traditional caching treats lookups as string comparisons. HTTP response caches and Redis both key off URLs or query strings, returning the stored value only when the key matches exactly. This works beautifully for classical workloads — serving the same catalog page to a thousand users, caching the result of a database query — because the inputs are identical across requests. LLM users break this assumption by design, because natural language has enormous expressive redundancy.

Consider these four questions from a real support-bot log, all expressing the same intent:

  • "How do I cancel my subscription?"
  • "Where can I unsubscribe?"
  • "I want to stop billing."
  • "Can you help me quit?"

Four strings, zero overlap. An exact-match cache hits 0% of the time, even though a human support agent would reuse the same answer verbatim for all four. Run them through a text-embedding model, however, and their pairwise cosine similarities fall between 0.89 and 0.94. The premise of semantic caching is that "if two queries sit close in embedding space, they can share a response." Under that premise, three of those four queries can be served from cache instead of a fresh LLM call.

Three concrete benefits come from this. First, obvious cost reduction — you stop paying for repeat inference. For a bot handling 10,000 queries a day at roughly $0.001 per Gemini Flash call, an 80% hit rate saves around $240 per month, which is often enough to fund the Postgres instance and still leave change. Second, latency: LLM calls routinely take 800 to 2,000 milliseconds, while a vector similarity lookup plus a cache read lands under 50. That's not just "faster" — it's the difference between "feels like a chat" and "feels like a form submission." Third, and often overlooked, is response stability. LLMs have bad days, regional outages, and quality drift between model versions. Cache hits reproduce past good answers instead of gambling on fresh ones, which is a quiet but real improvement in consistency.

The trade-off, of course, is that "close enough in meaning" can shade into "different intent." A query asking "how do I cancel my account?" sits in the same neighborhood as "how do I cancel my last order?" in embedding space, and mis-routing between the two is a real product bug. Much of this article is about making that trade-off quantitative rather than hopeful.

One more framing matters before we go deeper. Semantic caching is not a silver bullet — it's best suited to workloads where the same intent is expressed many times with variation. Support bots, FAQ systems, onboarding assistants, and technical documentation search all fit well. Code generation, creative writing, and highly personalized chat do not. If your users rarely repeat intents, spend your engineering budget elsewhere.

Architecture — start minimal, grow deliberately

Before reaching for complexity, nail down the smallest viable version. A semantic cache boils down to five steps:

  1. Receive a user query.
  2. Convert it to an embedding vector.
  3. Run top-1 similarity search in a vector store.
  4. If similarity exceeds the threshold, return the cached response.
  5. Otherwise, call the LLM and store the query-response pair with its embedding.

For the store, I recommend PostgreSQL with the pgvector extension. The reasoning is prosaic: you almost certainly already have a Postgres instance, and pgvector integrates cleanly with the app database you're already backing up, authenticating against, and monitoring. The pgvector RAG pipeline guide covers the basics, but the relevant point here is that HNSW indexing keeps nearest-neighbor lookup under 50ms at millions of rows. Managed alternatives like Pinecone or Upstash Vector are good, but adding another billing line, another API, and another backup process is a real cost. For solo developers and small SaaS teams, pgvector first is the pragmatic default. You can always migrate later if your cache grows past tens of millions of rows, but in practice most semantic caches stabilize around a few hundred thousand entries — far inside pgvector's sweet spot.

For the embedding model, I pair this with Google's text-embedding-004. It gives you 768-dimensional vectors across many languages for a fraction of a cent per thousand tokens. Keeping embedding cost low matters — we'll see later how getting this wrong can make the "cost savings" negative. Alternative models like OpenAI's text-embedding-3-small or locally-hosted Gemma variants work too, but the key criterion is throughput and price rather than top-of-benchmark recall. You don't need ImageNet-level quality to distinguish "cancel my subscription" from "show my invoice"; you need predictable latency and a billing curve that doesn't punish you for caching.

A design choice that matters more than people realize is whether to embed queries synchronously or asynchronously. Synchronous embedding (compute before storing) is simpler and avoids race conditions but adds to write latency. Asynchronous embedding (store the query and response first, embed in a background job) saves a few hundred milliseconds on write but creates a window where a query exists in the cache but can't be matched, which undermines hit rate on bursty workloads. For most production systems, synchronous wins. Revisit this only if your embedding API becomes a bottleneck.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
You can cut ballooning LLM costs by tuning similarity thresholds against real hit-rate data instead of guessing
You'll get a production-grade pgvector + FastAPI + Gemini implementation that Antigravity can maintain without losing context
You'll learn how to avoid the four production pitfalls — stale data, PII leakage, cross-lingual false hits, and cost inversion — before they bite
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

App Dev2026-05-03
Building Idempotency Keys and Dedupe Stores in TypeScript with Antigravity
A production guide to designing idempotency keys and dedupe stores in TypeScript with Antigravity — covering Stripe webhook retries, Temporal replays, and the Cloudflare KV / Redis / Postgres trade-offs you actually need to choose between.
App Dev2026-05-01
Zero-Downtime Database Migrations with Antigravity: The Expand-Contract Pattern in Production
A complete production guide to running breaking schema changes—type swaps, column renames, table splits—with zero user-facing downtime, using the Expand-Contract pattern with Antigravity's AI assistance.
App Dev2026-04-28
Antigravity × Unreal Engine 5 Plugin Production Guide — Designing Reusable Game Systems with AI for Multiple Titles
A production-grade guide to building UE5 plugins with Antigravity. Covers module design, build configuration, AI-friendly code separation, and distribution patterns for reusing one plugin across multiple titles.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →