Antigravity × Custom AI Chatbot Pipeline — Building Production-Grade Assistants with RAG, Function Calling, and Streaming UI
Learn how to build a production-grade AI chatbot by integrating RAG, Function Calling, and Streaming UI with Antigravity — from architecture design to Cloudflare Workers deployment.
Writing as an indie developer who runs the four Dolice Labs sites in parallel, let me get straight to it. Having shipped apps solo since 2014 and crossed 50M cumulative downloads, what stands out about this stack is that observability and AdMob revenue have to hold together at the same time.
Setup and context — Why Custom AI Chatbots Matter in 2026
In 2026, AI chatbots have evolved beyond simple question-answering tools into intelligent assistants deeply integrated into business workflows. Generic solutions like ChatGPT or Gemini can't always handle domain-specific knowledge, connect to proprietary systems, or stream responses in real time within custom applications. The demand for tailored AI assistants that combine these capabilities is growing rapidly.
This guide walks you through building a production-grade AI chatbot using Antigravity's AI agent capabilities, integrating three core technologies:
RAG (Retrieval-Augmented Generation): Searches your own documents and databases to improve answer accuracy and reduce hallucinations
Function Calling: Dynamically connects to external APIs and databases to fetch real-time information or perform actions
Streaming UI: Displays token-by-token responses in real time, dramatically improving perceived response speed
This article is aimed at engineers with experience building AI applications, assuming familiarity with TypeScript, Next.js, and vector databases. If you'd like to learn RAG fundamentals first, check out our Antigravity RAG Pipeline Guide.
Architecture Overview — A Three-Layer Design
The chatbot architecture is organized into three distinct layers, each handling a specific concern.
Presentation Layer (Streaming UI)
This is the frontend layer responsible for user interactions. Using the Vercel AI SDK's useChat hook, it implements Server-Sent Events (SSE) based streaming. As tokens are generated, they're reflected in the UI in real time, giving users an impression of near-instant responses.
Orchestration Layer (Function Calling Router)
This middleware layer mediates between the AI model and external tools. It analyzes user intent, selects the appropriate tool (function), and executes it. By combining multiple tools — weather lookups, database queries, external API calls — you can dramatically extend the AI's capabilities.
Knowledge Layer (RAG Pipeline)
This layer enhances answer accuracy through a knowledge base. Documents are split into chunks, converted to vector embeddings, and stored in a vector database. When a user asks a question, semantically similar documents are retrieved and passed as context to the LLM, significantly reducing hallucinations.
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Measured chunk-size trade-offs that drive RAG accuracy, plus the exact settings that lifted recall
✦The real-world fallback trigger rate, and how to keep availability high without runaway cost
✦A 7-item pre-production checklist (embedding cost, rate limiting, timeout design) hardened in real operation
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
The most critical factor in RAG pipeline quality is the document chunking strategy. Rather than simple character-count splitting, we adopt "semantic chunking" that splits at meaningful boundaries.
Let Antigravity's agent generate the following chunking module after explaining your project structure:
// src/lib/rag/chunker.ts// Semantic chunking for document preprocessinginterface DocumentChunk { id: string; content: string; metadata: { source: string; section: string; position: number; tokenCount: number; }; embedding?: number[];}const CHUNK_CONFIG = { maxTokens: 512, // Maximum tokens per chunk overlapTokens: 64, // Overlap between chunks minTokens: 100, // Minimum tokens (merge into previous if below)} as const;export function splitDocumentSemantically( text: string, source: string): DocumentChunk[] { // Primary split at section boundaries (headings) const sections = text.split(/(?=^#{1,3}\s)/m); const chunks: DocumentChunk[] = []; let position = 0; for (const section of sections) { const sectionTitle = section.match(/^#{1,3}\s(.+)/)?.[1] ?? "untitled"; const paragraphs = section.split(/\n\n+/); let currentChunk = ""; for (const para of paragraphs) { const combined = currentChunk ? `${currentChunk}\n\n${para}` : para; const tokenEstimate = Math.ceil(combined.length / 4); if (tokenEstimate > CHUNK_CONFIG.maxTokens && currentChunk) { chunks.push({ id: `${source}-${position}`, content: currentChunk.trim(), metadata: { source, section: sectionTitle, position: position++, tokenCount: Math.ceil(currentChunk.length / 4), }, }); // Overlap: include tail of previous chunk at start of next const overlapText = currentChunk.slice( -(CHUNK_CONFIG.overlapTokens * 4) ); currentChunk = `${overlapText}\n\n${para}`; } else { currentChunk = combined; } } if (currentChunk.trim()) { chunks.push({ id: `${source}-${position}`, content: currentChunk.trim(), metadata: { source, section: sectionTitle, position: position++, tokenCount: Math.ceil(currentChunk.length / 4), }, }); } } return chunks;}
Vector Embeddings and Index Building
Once documents are chunked, we convert them into vector embeddings and store them in a vector database. This example uses Cloudflare Vectorize, but the same pattern applies to Pinecone or Weaviate.
// src/lib/rag/embedder.ts// Vector embedding generation and index storageinterface EmbeddingResult { chunkId: string; vector: number[]; dimensions: number;}export async function generateEmbeddings( chunks: DocumentChunk[], env: { AI: Ai }): Promise<EmbeddingResult[]> { // Use Cloudflare Workers AI embedding model const batchSize = 50; // Match API batch limits const results: EmbeddingResult[] = []; for (let i = 0; i < chunks.length; i += batchSize) { const batch = chunks.slice(i, i + batchSize); const texts = batch.map((c) => c.content); // @cf/baai/bge-large-en-v1.5 produces high-quality 768-dimensional embeddings const response = await env.AI.run( "@cf/baai/bge-large-en-v1.5", { text: texts } ); // Expected output: { data: Array<number[]> } for (let j = 0; j < batch.length; j++) { results.push({ chunkId: batch[j].id, vector: response.data[j], dimensions: 768, }); } } return results;}export async function indexToVectorize( chunks: DocumentChunk[], embeddings: EmbeddingResult[], env: { VECTORIZE: VectorizeIndex }): Promise<void> { const vectors = chunks.map((chunk, i) => ({ id: chunk.id, values: embeddings[i].vector, metadata: { content: chunk.content, source: chunk.metadata.source, section: chunk.metadata.section, }, })); // Vectorize batch limit is 1000 items const insertBatchSize = 1000; for (let i = 0; i < vectors.length; i += insertBatchSize) { await env.VECTORIZE.upsert(vectors.slice(i, i + insertBatchSize)); }}
Query Optimization — Hybrid Search
Simply passing the user's raw question to vector search doesn't always produce accurate results. To improve retrieval quality, we combine query rewriting with hybrid search (vector search + keyword search).
// src/lib/rag/retriever.ts// Hybrid search for context retrievalinterface RetrievalResult { content: string; score: number; source: string; section: string;}export async function retrieveContext( query: string, env: { AI: Ai; VECTORIZE: VectorizeIndex }, options: { topK?: number; scoreThreshold?: number } = {}): Promise<RetrievalResult[]> { const { topK = 5, scoreThreshold = 0.7 } = options; // Step 1: Rewrite query for optimal search performance const rewrittenQuery = await rewriteQuery(query, env); // Step 2: Vector search const queryEmbedding = await env.AI.run( "@cf/baai/bge-large-en-v1.5", { text: [rewrittenQuery] } ); const vectorResults = await env.VECTORIZE.query( queryEmbedding.data[0], { topK: topK * 2, // Fetch extra before score filtering returnMetadata: "all", } ); // Step 3: Filter by score threshold const filtered = vectorResults.matches .filter((m) => m.score >= scoreThreshold) .slice(0, topK) .map((m) => ({ content: m.metadata?.content as string, score: m.score, source: m.metadata?.source as string, section: m.metadata?.section as string, })); return filtered;}async function rewriteQuery( originalQuery: string, env: { AI: Ai }): Promise<string> { // Use LLM to optimize the query for search const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", { messages: [ { role: "system", content: "Rewrite the user question as a search query optimized for semantic search. Return only the rewritten query, nothing else.", }, { role: "user", content: originalQuery }, ], max_tokens: 200, }); return (response as { response: string }).response || originalQuery;}
Function Calling — Dynamic External Tool Integration
Tool Definitions and Schema Design
Function Calling allows your AI chatbot to dynamically interact with external APIs and databases. The key is writing clear, detailed tool definitions — the AI model reads these to decide which tools to use and when. The schema patterns from our Zod Schema-Driven Development Guide are directly applicable here.
// src/lib/tools/definitions.ts// Tool definitions for AI to useimport { z } from "zod";export const toolDefinitions = { // Product search tool searchProducts: { description: "Search for products in the catalog by name, category, or price range. Use this when the user asks about available products or wants recommendations.", parameters: z.object({ query: z.string().describe("Search query for product name or description"), category: z.string().optional().describe("Product category filter"), minPrice: z.number().optional().describe("Minimum price in USD"), maxPrice: z.number().optional().describe("Maximum price in USD"), limit: z.number().default(5).describe("Maximum number of results"), }), }, // Inventory check tool checkInventory: { description: "Check the current inventory status for a specific product. Use this when the user asks if a product is in stock.", parameters: z.object({ productId: z.string().describe("The product ID to check"), }), }, // Order creation tool createOrder: { description: "Create a new order for the user. Only use this after confirming the product and quantity with the user.", parameters: z.object({ productId: z.string().describe("The product ID to order"), quantity: z.number().min(1).describe("Quantity to order"), shippingAddress: z.string().describe("Delivery address"), }), }, // Knowledge base search (integrates with RAG) searchKnowledgeBase: { description: "Search the internal knowledge base for answers to user questions about policies, procedures, or product documentation.", parameters: z.object({ query: z.string().describe("The user's question to search for"), }), },};
Tool Execution Engine
Beyond defining tools, you need a runtime that actually executes them. Structuring each tool's return value consistently helps the AI interpret results correctly.
Using the Vercel AI SDK (ai package), we implement an SSE-based streaming API. This complete API route integrates both Function Calling and RAG.
// src/app/api/chat/route.ts// Streaming chat API routeimport { streamText } from "ai";import { createGoogleGenerativeAI } from "@ai-sdk/google";import { z } from "zod";const google = createGoogleGenerativeAI({ apiKey: process.env.GOOGLE_AI_API_KEY,});export async function POST(req: Request) { const { messages } = await req.json(); // RAG context retrieval using the latest user message const lastUserMessage = messages .filter((m: { role: string }) => m.role === "user") .pop(); const ragContext = lastUserMessage ? await retrieveContext(lastUserMessage.content, env) : []; // Inject RAG context into system prompt const systemPrompt = buildSystemPrompt(ragContext); const result = streamText({ model: google("gemini-2.5-pro"), system: systemPrompt, messages, tools: { searchProducts: { description: "Search for products in the catalog", parameters: z.object({ query: z.string(), category: z.string().optional(), limit: z.number().default(5), }), execute: async (args) => { const result = await executeTool("searchProducts", args, env); return result.data; }, }, checkInventory: { description: "Check product inventory status", parameters: z.object({ productId: z.string(), }), execute: async (args) => { const result = await executeTool("checkInventory", args, env); return result.data; }, }, }, maxSteps: 5, // Limit on chained tool calls onFinish: async ({ usage }) => { // Log token usage for cost tracking console.log( `Tokens used: ${usage.promptTokens} prompt, ${usage.completionTokens} completion` ); }, }); return result.toDataStreamResponse();}function buildSystemPrompt(ragContext: RetrievalResult[]): string { const basePrompt = `You are an AI assistant that answers questions about products.Provide polite and accurate responses.`; if (ragContext.length === 0) return basePrompt; const contextSection = ragContext .map((r) => `[Source: ${r.source}]\n${r.content}`) .join("\n\n---\n\n"); return `${basePrompt}Below are relevant excerpts from our documentation. Prioritize this information in your answers:${contextSection}Important: If the information isn't found in the documents, honestly say "I wasn't able to confirm that."`;}
Frontend: Real-Time Chat UI
On the frontend, we use the useChat hook to render streaming responses in real time. Intermediate tool call states are also displayed, making the AI's thought process visible to users.
As conversations grow longer, managing the context window (token limit) becomes critical. Sending the entire conversation history with every request causes costs to spike and eventually hits the token limit.
Sliding Window + Summary Strategy
The most effective balance between cost and quality is the "sliding window + summary" approach: keep the most recent N messages verbatim while replacing older messages with a compressed summary.
// src/lib/memory/conversation-manager.ts// Conversation memory — sliding window + summaryinterface Message { role: "user" | "assistant" | "system"; content: string;}interface ManagedConversation { summary: string | null; // Summary of older messages recentMessages: Message[]; // Recent conversation history totalTokenEstimate: number;}const MEMORY_CONFIG = { maxRecentMessages: 20, // Number of recent messages to keep maxTokenBudget: 8000, // Token budget for context window summaryTriggerCount: 15, // Message count threshold for summarization} as const;export async function manageConversation( allMessages: Message[], env: { AI: Ai }): Promise<ManagedConversation> { // If message count is below threshold, return everything if (allMessages.length <= MEMORY_CONFIG.maxRecentMessages) { return { summary: null, recentMessages: allMessages, totalTokenEstimate: estimateTokens(allMessages), }; } // Convert older messages into a summary const oldMessages = allMessages.slice( 0, allMessages.length - MEMORY_CONFIG.maxRecentMessages ); const recentMessages = allMessages.slice( allMessages.length - MEMORY_CONFIG.maxRecentMessages ); const summary = await generateSummary(oldMessages, env); return { summary, recentMessages, totalTokenEstimate: estimateTokens(recentMessages) + estimateTokens([ { role: "system", content: summary }, ]), };}async function generateSummary( messages: Message[], env: { AI: Ai }): Promise<string> { const conversationText = messages .map((m) => `${m.role}: ${m.content}`) .join("\n"); const result = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", { messages: [ { role: "system", content: "Summarize the following conversation concisely. Preserve important details like user names, order contents, and question context. Keep it under 200 words.", }, { role: "user", content: conversationText }, ], max_tokens: 300, }); return (result as { response: string }).response;}function estimateTokens(messages: Message[]): number { return messages.reduce((sum, m) => sum + Math.ceil(m.content.length / 4), 0);}
Production Design Patterns
Rate Limiting and Cost Management
In production, abuse prevention and cost management are essential. We implement distributed rate limiting using Cloudflare Workers KV with per-user token budgets.
To handle AI model API outages and timeouts, we implement a multi-layer fallback strategy that automatically switches to backup models when the primary model is unresponsive.
Finally, here's the deployment configuration for your chatbot on Cloudflare Workers. For a broader overview of building AI apps on Cloudflare Workers, see our Antigravity × Cloudflare Workers AI Edge App Guide. The wrangler.toml sets up bindings for Vectorize, D1, and KV.
What the docs don't tell you — lessons from running this in production
Everything above is design and implementation. But once a chatbot runs in production for a while, you hit judgment calls that no documentation covers. Over a few months of embedding a small RAG chatbot into the support flow of my wallpaper and wellness apps, here are the lessons that mattered most.
A chunk size of around 512 tokens turned out to be the practical sweet spot
Official samples often use larger chunks (1,000–1,500 tokens), but for an app dominated by short FAQ-style questions, that actually hurt accuracy. A large chunk packs several topics into one embedding vector, blurring the search target.
Measuring top-3 recall across chunk sizes on my FAQ dataset (~1,200 entries) showed roughly this pattern:
1,200 tokens: ~72% recall
768 tokens: ~81% recall
512 tokens (overlap 64): ~88% recall
256 tokens: ~83% recall (context gets cut, so it drops again)
So "smaller is better" isn't the rule — the sweet spot is the smallest chunk that doesn't sever context. For my use case, 512 tokens with an overlap of 64 was consistently strong. This shifts with your content, so always measure once on your own data.
Multi-model fallback wasn't insurance — it was a daily event
While building, I treated fallback as insurance against rare outages. But aggregating production logs, the share of requests that dropped to fallback from primary timeouts, rate limits, or transient 5xx reached 3–5% during peak hours. Even at tens of thousands of requests per month, that's not negligible.
What helped was routing the fallback to a faster, lighter model. When the high-end primary is congested, everything tends to be congested, so swapping to another same-tier model just stalls again. Escaping immediately to a flash-class lightweight model kept perceived latency far steadier.
Seven things to settle before going to production
Here is a checklist of items I wish I had handled from day one.
Make embedding regeneration explicit (regenerate only on document update, not every time — this changes cost by 10x)
Use two-stage streaming timeouts (time-to-first-token and time-to-completion)
Rate-limit by both user and IP (just one is trivially bypassed)
Always log fallback triggers and review the rate weekly
Manage the conversation memory cap by token count, not message count
Prepare user-facing failure copy (a plain "we're busy right now" reduced bounce)
Break down monthly cost into embedding, inference, and vector search
Item 1 had the biggest impact: switching from regenerating embeddings on every request to regenerating only on update cut embedding API cost by nearly an order of magnitude. Balancing observability and cost maps directly onto the years I've spent weighing ad revenue against operating cost in my app business — get this wrong and a small solo project tips into the red fast.
Summary
If you build just one thing next, start the RAG pipeline with a 512-token chunk size and an explicit "regenerate on update only" rule — that single decision shapes both your accuracy and your monthly cost more than any other choice here.
The key takeaways are: semantic chunking for RAG accuracy, Zod schema-based type-safe Function Calling, sliding window conversation memory management, and multi-model fallback for high availability. Combining these patterns gives you a practical, robust AI assistant ready for real-world use.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.