Semantic Caching for LLM Responses in Antigravity — A Production Guide to Cutting Costs by 80% with Similarity-Based Reuse
A production guide to building a semantic LLM response cache with Antigravity, pgvector, and Gemini. Covers threshold tuning, production pitfalls, monitoring metrics, and runnable implementation code.
Have you ever stared at your LLM invoice and realized you're paying full price every time a user asks roughly the same question in slightly different words? I ran into exactly this problem when the Gemini bill for a small side-project chatbot quietly tripled over one month. Digging through the logs, the pattern was obvious: hundreds of semantically identical questions a day, each phrased just differently enough to miss my Redis key-value cache.
A traditional exact-match cache is blind to meaning. This guide walks through building a semantic cache — one that uses embeddings to match queries by intent — using Antigravity as your implementation partner, and taking it all the way to a version that survives production. The finished implementation is modest: pgvector, FastAPI, Gemini. What takes longer is the threshold tuning and the pitfalls you want to avoid. I've tried to document the ones I stumbled into personally, so you don't have to.
By the end, you'll have a working implementation you can paste into an existing FastAPI service, a measurement scaffold that makes threshold tuning data-driven rather than gut-feel, and a practical map of the production hazards that separate "works on my laptop" from "pays for itself three times over each month."
Why LLMs need meaning-based caching, not exact matching
Traditional caching treats lookups as string comparisons. HTTP response caches and Redis both key off URLs or query strings, returning the stored value only when the key matches exactly. This works beautifully for classical workloads — serving the same catalog page to a thousand users, caching the result of a database query — because the inputs are identical across requests. LLM users break this assumption by design, because natural language has enormous expressive redundancy.
Consider these four questions from a real support-bot log, all expressing the same intent:
"How do I cancel my subscription?"
"Where can I unsubscribe?"
"I want to stop billing."
"Can you help me quit?"
Four strings, zero overlap. An exact-match cache hits 0% of the time, even though a human support agent would reuse the same answer verbatim for all four. Run them through a text-embedding model, however, and their pairwise cosine similarities fall between 0.89 and 0.94. The premise of semantic caching is that "if two queries sit close in embedding space, they can share a response." Under that premise, three of those four queries can be served from cache instead of a fresh LLM call.
Three concrete benefits come from this. First, obvious cost reduction — you stop paying for repeat inference. For a bot handling 10,000 queries a day at roughly $0.001 per Gemini Flash call, an 80% hit rate saves around $240 per month, which is often enough to fund the Postgres instance and still leave change. Second, latency: LLM calls routinely take 800 to 2,000 milliseconds, while a vector similarity lookup plus a cache read lands under 50. That's not just "faster" — it's the difference between "feels like a chat" and "feels like a form submission." Third, and often overlooked, is response stability. LLMs have bad days, regional outages, and quality drift between model versions. Cache hits reproduce past good answers instead of gambling on fresh ones, which is a quiet but real improvement in consistency.
The trade-off, of course, is that "close enough in meaning" can shade into "different intent." A query asking "how do I cancel my account?" sits in the same neighborhood as "how do I cancel my last order?" in embedding space, and mis-routing between the two is a real product bug. Much of this article is about making that trade-off quantitative rather than hopeful.
One more framing matters before we go deeper. Semantic caching is not a silver bullet — it's best suited to workloads where the same intent is expressed many times with variation. Support bots, FAQ systems, onboarding assistants, and technical documentation search all fit well. Code generation, creative writing, and highly personalized chat do not. If your users rarely repeat intents, spend your engineering budget elsewhere.
Architecture — start minimal, grow deliberately
Before reaching for complexity, nail down the smallest viable version. A semantic cache boils down to five steps:
Receive a user query.
Convert it to an embedding vector.
Run top-1 similarity search in a vector store.
If similarity exceeds the threshold, return the cached response.
Otherwise, call the LLM and store the query-response pair with its embedding.
For the store, I recommend PostgreSQL with the pgvector extension. The reasoning is prosaic: you almost certainly already have a Postgres instance, and pgvector integrates cleanly with the app database you're already backing up, authenticating against, and monitoring. The pgvector RAG pipeline guide covers the basics, but the relevant point here is that HNSW indexing keeps nearest-neighbor lookup under 50ms at millions of rows. Managed alternatives like Pinecone or Upstash Vector are good, but adding another billing line, another API, and another backup process is a real cost. For solo developers and small SaaS teams, pgvector first is the pragmatic default. You can always migrate later if your cache grows past tens of millions of rows, but in practice most semantic caches stabilize around a few hundred thousand entries — far inside pgvector's sweet spot.
For the embedding model, I pair this with Google's text-embedding-004. It gives you 768-dimensional vectors across many languages for a fraction of a cent per thousand tokens. Keeping embedding cost low matters — we'll see later how getting this wrong can make the "cost savings" negative. Alternative models like OpenAI's text-embedding-3-small or locally-hosted Gemma variants work too, but the key criterion is throughput and price rather than top-of-benchmark recall. You don't need ImageNet-level quality to distinguish "cancel my subscription" from "show my invoice"; you need predictable latency and a billing curve that doesn't punish you for caching.
A design choice that matters more than people realize is whether to embed queries synchronously or asynchronously. Synchronous embedding (compute before storing) is simpler and avoids race conditions but adds to write latency. Asynchronous embedding (store the query and response first, embed in a background job) saves a few hundred milliseconds on write but creates a window where a query exists in the cache but can't be matched, which undermines hit rate on bursty workloads. For most production systems, synchronous wins. Revisit this only if your embedding API becomes a bottleneck.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦You can cut ballooning LLM costs by tuning similarity thresholds against real hit-rate data instead of guessing
✦You'll get a production-grade pgvector + FastAPI + Gemini implementation that Antigravity can maintain without losing context
✦You'll learn how to avoid the four production pitfalls — stale data, PII leakage, cross-lingual false hits, and cost inversion — before they bite
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Building the minimum viable version in 30 minutes with Antigravity
Let me walk through the actual implementation. I drive Antigravity in Manager mode, ask it to scaffold the skeleton, and then refine specifics by hand. The key to getting useful output is to pin the runtime assumptions up front: Python 3.11, FastAPI, the google-genai SDK, Postgres 16 with pgvector. Without these constraints, Antigravity tends to generate overly abstract code — adapter patterns, pluggable embeddings, "future-proof" interfaces — that looks impressive in review but adds friction for every real change.
Below is the SemanticCache class, scoped to be drop-in for a FastAPI support-bot. I've included expected output inline so you can verify it works.
# semantic_cache.py — drop in to an existing FastAPI app# Prereqs: PostgreSQL 16 + pgvector, google-genai SDK, asyncpg# Expected: first call hits the LLM; subsequent similar calls return cached answer at similarity >= 0.92import osimport asyncioimport asyncpgfrom google import genaifrom google.genai import typesEMBEDDING_MODEL = "text-embedding-004"GENERATION_MODEL = "gemini-2.5-flash"SIMILARITY_THRESHOLD = 0.92 # tune inside 0.90-0.95 based on real dataCACHE_TTL_SECONDS = 60 * 60 * 24 * 7 # invalidate after 1 weekclient = genai.Client(api_key=os.environ["GEMINI_API_KEY"])class SemanticCache: def __init__(self, pool: asyncpg.Pool): self.pool = pool async def _embed(self, text: str) -> list[float]: """Fetch a 768-dim embedding via Gemini.""" result = await asyncio.to_thread( client.models.embed_content, model=EMBEDDING_MODEL, contents=text, config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY"), ) return result.embeddings[0].values async def lookup(self, query: str) -> tuple[str, float] | None: """Search for a semantically similar cached response. Returns (response, similarity) on hit.""" vec = await self._embed(query) async with self.pool.acquire() as conn: row = await conn.fetchrow( """ SELECT response, 1 - (embedding <=> $1::vector) AS similarity FROM llm_cache WHERE created_at > NOW() - INTERVAL '%s seconds' ORDER BY embedding <=> $1::vector LIMIT 1 """ % CACHE_TTL_SECONDS, vec, ) if row and row["similarity"] >= SIMILARITY_THRESHOLD: return (row["response"], row["similarity"]) return None async def store(self, query: str, response: str) -> None: """Persist query/response with its embedding. We embed synchronously here — deferring it sounds appealing but leads to the 'cost inversion' pitfall we discuss later.""" vec = await self._embed(query) async with self.pool.acquire() as conn: await conn.execute( "INSERT INTO llm_cache (query, embedding, response) VALUES ($1, $2::vector, $3)", query, vec, response, ) async def ask(self, query: str) -> dict: """Public entry point. Records both hits and misses.""" hit = await self.lookup(query) if hit: return {"response": hit[0], "cached": True, "similarity": hit[1]} gen = await asyncio.to_thread( client.models.generate_content, model=GENERATION_MODEL, contents=query, ) text = gen.text await self.store(query, text) return {"response": text, "cached": False, "similarity": None}
The matching schema looks like this. The HNSW index keeps lookup sub-50ms even at seven-figure row counts.
-- schema.sql — run once via psqlCREATE EXTENSION IF NOT EXISTS vector;CREATE TABLE llm_cache ( id BIGSERIAL PRIMARY KEY, query TEXT NOT NULL, embedding vector(768) NOT NULL, response TEXT NOT NULL, created_at TIMESTAMPTZ DEFAULT NOW(), hit_count INT DEFAULT 0);-- HNSW index for cosine distance (pgvector 0.5+)CREATE INDEX llm_cache_embedding_idx ON llm_cache USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);CREATE INDEX llm_cache_created_at_idx ON llm_cache (created_at DESC);
A note on index parameters: m = 16 and ef_construction = 64 are sensible defaults for cache-sized corpora (up to a few million rows). The m parameter is the number of connections per node in the HNSW graph; higher values improve recall but slow builds. The ef_construction parameter affects quality at build time. Once your cache is populated, consider tuning ef_search at query time (via SET LOCAL hnsw.ef_search = 40) — it's the quality/latency knob you actually want to reach for in production, not the index-time parameters.
That's the minimum viable version. Send two semantically similar curl requests and the second one returns cached: true with a similarity somewhere in the 0.93–0.97 range.
Tuning the similarity threshold — balancing hit rate and false positives
This is where most production deployments stand or fall. Lowering the threshold raises hit rate but also raises the risk of serving an answer intended for a slightly different intent — a false hit. Raising it is safer but eats into savings.
My rule of thumb is to start at 0.92 and let real data push it somewhere between 0.90 and 0.95. "Rule of thumb" alone is not a production methodology, though. What actually matters is instrumenting from day one so you can compute the right threshold later. Teams that ship semantic caching without logging usually end up re-shipping it three months later when they realize they can't defend the threshold choice in a postmortem.
The following logger records every lookup decision, including the top similarity score, so you can mine it afterwards.
# similarity_logger.py — log every cache decision for later analysisasync def log_decision( pool: asyncpg.Pool, query: str, top_similarity: float | None, used_cache: bool, response_preview: str,) -> None: """Record every threshold-gated decision so we can back out the right cut-off.""" async with pool.acquire() as conn: await conn.execute( """ INSERT INTO cache_decisions (query, top_similarity, used_cache, response_preview, created_at) VALUES ($1, $2, $3, $4, NOW()) """, query, top_similarity, used_cache, response_preview[:200], )# Run after a week of traffic (directly in psql):# -- Histogram of top-similarity bands, split by cache hit status# SELECT# width_bucket(top_similarity, 0.80, 1.00, 20) AS bucket,# COUNT(*) AS total,# COUNT(*) FILTER (WHERE used_cache) AS cache_hits# FROM cache_decisions# WHERE created_at > NOW() - INTERVAL '7 days'# AND top_similarity IS NOT NULL# GROUP BY bucket ORDER BY bucket;
What you typically see in the 0.88–0.92 band is a cluster of near-misses — queries that share intent but didn't quite clear the bar. Sample them randomly, review a few dozen manually, and pick the lowest threshold where your false-hit rate stays under 1%. Avoid handing this step to the agent entirely. The judgment of whether a cached response is "good enough" for a mis-matched query is product design, not tooling, and it needs a human signoff.
One more heuristic I've found useful: set the threshold 0.01 tighter than your data suggests. The 1% buffer absorbs noise from embedding model updates (they do happen) and from distribution shift as your user base changes. A cache tuned to the knife-edge will surface false hits the first time your product evolves.
Invalidation and privacy — trade-offs to settle at design time
Semantic caches have invalidation problems that don't exist for plain KV caches. They also capture user utterances, which introduces privacy concerns the moment you turn the system on.
Three invalidation strategies are worth knowing. Time-based TTLs — shown above as one week — work well for stable documentation but poorly for volatile data like pricing or inventory. Version-based invalidation adds a cache_version column that flushes the whole cache whenever you change the prompt template, system instructions, or model. Tag-based invalidation segments the cache by tenant or category so you can purge a slice without losing the rest. Most real deployments use a combination: short TTL for volatile topics, version-gated everything, tenant tags for isolation.
On the privacy side, assume user input may contain PII — email addresses, phone numbers, internal identifiers, account numbers — and design for it. A lightweight regex pre-filter that skips PII-bearing queries is the minimum bar. If you serve multiple tenants, go further and scope every cache entry to a tenant ID so tenant A's query can't hit tenant B's cached response. The Antigravity cost-optimization guide touches on this briefly, but the general rule is that cost savings and safety always trade against each other — you pay for isolation in hit rate.
A subtler privacy question is whether to store the raw query text at all. My default is yes, because debugging is impossible without it. But if your compliance requirements forbid persisting user input, you can store only the embedding and a hash of the query, losing debuggability but preserving the functional cache. Whichever way you go, make it a deliberate decision and write it down — someone on your future team will thank you.
# privacy_filter.py — skip PII-bearing queries before they land in the cacheimport rePII_PATTERNS = [ re.compile(r"[\w\.-]+@[\w\.-]+\.\w+"), # email re.compile(r"\b\d{3}-\d{4}-\d{4}\b"), # phone-like re.compile(r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b"), # card number-like]def contains_pii(text: str) -> bool: return any(p.search(text) for p in PII_PATTERNS)# Usage before storingif contains_pii(query): # Still call the LLM, but don't persist the query/response pair passelse: await cache.store(query, response)
For a broader take on observability in LLM systems — audit logs, anomaly detection, the full monitoring picture — LLMOps production monitoring is the companion read I'd point you to.
Pitfalls — four production failures I actually caused
This section is where I admit to the mistakes that cost me sleep. You should be able to skip each of these on my behalf.
Pitfall 1: Not including prompt template or system instructions in the cache key.
One day I updated the bot's tone from "formal" to "friendly." New answers used the new tone, but the cache kept handing out the old ones. The lookup was keyed purely on the query embedding, which had no knowledge of the system prompt. The fix is a cache_version column that stores a SHA-256 hash of the system prompt, template, model name, and temperature — and the lookup matches only against rows with the current version. Change any of these, cache entries naturally become invisible until rebuilt. Worth noting: don't store the full prompt, just the hash. Prompts can contain customer-specific context, and replicating them across the cache table becomes a compliance liability.
Pitfall 2: Caching time-sensitive responses.
"What's today's exchange rate?" caches terribly. Returning yesterday's answer today damages trust fast. Two ways to handle it: either use a lightweight classifier (regex or a small LLM call) to detect temporal queries and skip the cache for them, or aggressively shorten the TTL for any query that mentions "today," "latest," "current," and similar markers. I went with the classifier route — it has fewer edge cases and handles phrasing variety better than regex. A small Gemini Flash call costs less than a cent per thousand classifications and catches "at the moment," "right now," "as we speak," and all the other temporal phrasings regex never quite nails.
Pitfall 3: Forgetting embedding cost inverts the savings for tiny queries.
This one surprised me. For short queries, the embedding API call can cost more than the LLM call it was trying to save. A 10-character question routed through text-embedding-004 costs a tiny fraction of a cent, but you pay it every time. On Gemini Flash with cache hits, the savings per hit can be smaller than the embedding charge on every lookup. The fix is a two-layer cache: first check an exact-match hash cache (free), then fall back to the semantic one on miss. A few dozen lines of code, and seven or eight out of ten queries in a typical support bot satisfy the first layer alone. The lesson generalizes: when savings are small, overhead dominates, and the solution is usually a hierarchy of cheaper-first lookups.
Pitfall 4: Cross-lingual false hits in multilingual apps.
Multilingual embedding models place semantically equivalent sentences close together regardless of language. This is a feature for RAG, a bug for caching. An English query "How do I cancel?" can hit a Japanese cache entry and return a Japanese response to an English user. The fix is dumb but effective: include a detected language code in the cache key. The langdetect library does the job in a few milliseconds per query. Bonus: the same tag lets you serve different cached responses per language when your underlying knowledge base is regional — legal disclaimers and pricing often differ by locale in ways the LLM cannot know.
Testing and evaluation — building the quality safety net
One question that comes up as soon as you ship this to staging: how do you verify the cache isn't degrading quality? An untested semantic cache is essentially a random-response generator with a 1% failure rate, and that failure rate compounds in scary ways across user sessions.
The minimum viable evaluation approach is a golden dataset. Collect 100 to 300 real queries from your logs, hand-annotate the correct answer or accepted answer range, and re-run the dataset against your cache periodically. Compare the cached answer with the annotated one using either exact-match, a bleu-like overlap score, or — most accurately — an LLM-as-judge prompt that grades whether the cached answer is substantively equivalent to the annotation. Run this nightly and alert on regressions.
A more ambitious approach is shadow testing. Route every cache hit through a background thread that also fires the real LLM call, then compare the cached and fresh responses. This doubles your LLM spend during the evaluation window, so keep it to a sampled percentage (say, 1% of traffic). The payoff is a continuous signal about how often cached responses diverge from what the fresh LLM would say, which is the closest you can get to "is this cache still safe" without asking users.
For both approaches, the judgment layer matters. Exact-match is too strict because LLM output has natural variation; a character-by-character comparison will flag tolerable paraphrases as mismatches. Semantic comparison via another embedding call is closer but still coarse. The honest answer is that an LLM judge with a carefully written rubric works best, at the cost of another tokens-per-evaluation line item. Budget for this, because the alternative is learning about quality regressions from customer complaints.
Scaling considerations — from one instance to one thousand
The implementation above works for a single Postgres instance serving a single application. A few scaling notes for when you outgrow that.
First, the cache table will grow. At 10,000 queries per day and a 1-week TTL, you're looking at 70,000 rows plus embeddings — roughly 250MB of storage. That's fine. At 1 million queries per day, you're looking at 7 million rows and around 25GB, which strains commodity HNSW indexes. Two mitigations: shorten the TTL (if most hits come in the first day, a 24-hour TTL may lose almost no value) and consider partitioning by tenant or by time window, pruning old partitions rather than doing per-row deletes.
Second, read-heavy cache workloads benefit from read replicas. Point the lookup call at a read replica, keep store on the primary, and you can scale horizontally without much fuss. Watch for replication lag — a write followed immediately by a read on the same query can miss the cache until the replica catches up. In practice this is rare enough (replicas typically lag under a second) that it's not worth engineering around.
Third, hot keys still matter. If 5% of your queries produce 60% of the hits, the lookup for those queries is your real bottleneck, not the index. Warm a small in-memory LRU cache in front of the pgvector lookup for the top hundred queries, refreshed every minute. This is boring, but it gets you the last 10x of latency improvement.
Finally, for genuinely large deployments — tens of millions of cache entries — consider migrating from pgvector to a purpose-built vector DB like Qdrant or Weaviate. The break-even in my experience is somewhere around 50 million vectors. Below that, pgvector's operational simplicity usually wins.
Monitoring and A/B testing — turning savings into numbers
Shipping isn't done until you can explain savings as a number to someone else. Four metrics I instrument on day one of every deployment:
Cache hit rate, broken down by time of day and tenant.
Similarity-score histogram so you can see whether your threshold sits in the right place.
Inferred cost savings, computed as hit count times average token cost avoided.
Response latency p50 and p95, split by cache hits versus misses.
Exporting these to Prometheus is a good place to let Antigravity do the boilerplate. Ask it to "add Prometheus metrics to this Python code" and it produces reasonable scaffolding. What you must specify yourself is what to measure. Left to its own devices, the agent tends to standardize on RED (Rate / Error / Duration) and drop domain-specific signals like hit rate and similarity distribution. This is the thread you'll find repeatedly: agents are excellent at boilerplate, weak at deciding which boilerplate is relevant.
For A/B testing, route users to cache-on versus cache-off groups by hashing their user ID, then run both for a week and compare user satisfaction (thumbs-up rate), re-ask rate (how often users re-ask within a minute), average response time, and daily cost. In my case, satisfaction stayed within ±1 point, cost fell 82%, and p50 latency dropped from 1.4 seconds to under 300ms. Those are the numbers that justify continued investment.
One caution on A/B testing: a week may not be enough if your traffic has weekly seasonality (enterprise bots are notorious for this — Monday mornings look nothing like Friday afternoons). Run the test for at least two full weeks if you can afford the wait, or run it continuously with a small sample (say, 5% cache-off) as a permanent canary.
One concrete next step
That's a lot of ground. Here's the single thing worth doing today. Sample 500 to 1,000 recent user queries from your current bot or support system, and count duplicates by eye. If more than 30% feel like "I've seen this question before," the investment in a semantic cache pays back almost certainly. If less than 10%, there's probably a higher-ROI target elsewhere — prompt compression, model downgrading, or batching.
Following the structure above, the minimum viable version takes an afternoon, and monitoring plus threshold tuning fits into a two- or three-day sprint. The real insight isn't that semantic caching is clever — it's that building it alongside the measurement scaffolding, not after, is what lets you iterate fast. Numbers for hit rate, similarity distribution, and false-hit rate, visible in a dashboard, turn the whole thing from a guess into a system you can actually improve.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.