⚙ AI Tools/2026-03-30Advanced

Antigravity × Custom AI Chatbot Pipeline — Building Production-Grade Assistants with RAG, Function Calling, and Streaming UI

Learn how to build a production-grade AI chatbot by integrating RAG, Function Calling, and Streaming UI with Antigravity — from architecture design to Cloudflare Workers deployment.

antigravity⁴⁰² ai-chatbot rag⁸ function-calling⁵ streaming-ui vercel-ai-sdk cloudflare-workers⁷ vector-search³ production⁶⁸

✦ Premium Article

Writing as an indie developer who runs the four Dolice Labs sites in parallel, let me get straight to it. Having shipped apps solo since 2014 and crossed 50M cumulative downloads, what stands out about this stack is that observability and AdMob revenue have to hold together at the same time.

Setup and context — Why Custom AI Chatbots Matter in 2026

In 2026, AI chatbots have evolved beyond simple question-answering tools into intelligent assistants deeply integrated into business workflows. Generic solutions like ChatGPT or Gemini can't always handle domain-specific knowledge, connect to proprietary systems, or stream responses in real time within custom applications. The demand for tailored AI assistants that combine these capabilities is growing rapidly.

This guide walks you through building a production-grade AI chatbot using Antigravity's AI agent capabilities, integrating three core technologies:

RAG (Retrieval-Augmented Generation): Searches your own documents and databases to improve answer accuracy and reduce hallucinations
Function Calling: Dynamically connects to external APIs and databases to fetch real-time information or perform actions
Streaming UI: Displays token-by-token responses in real time, dramatically improving perceived response speed

This article is aimed at engineers with experience building AI applications, assuming familiarity with TypeScript, Next.js, and vector databases. If you'd like to learn RAG fundamentals first, check out our Antigravity RAG Pipeline Guide.

Architecture Overview — A Three-Layer Design

The chatbot architecture is organized into three distinct layers, each handling a specific concern.

Presentation Layer (Streaming UI)

This is the frontend layer responsible for user interactions. Using the Vercel AI SDK's useChat hook, it implements Server-Sent Events (SSE) based streaming. As tokens are generated, they're reflected in the UI in real time, giving users an impression of near-instant responses.

Orchestration Layer (Function Calling Router)

This middleware layer mediates between the AI model and external tools. It analyzes user intent, selects the appropriate tool (function), and executes it. By combining multiple tools — weather lookups, database queries, external API calls — you can dramatically extend the AI's capabilities.

Knowledge Layer (RAG Pipeline)

This layer enhances answer accuracy through a knowledge base. Documents are split into chunks, converted to vector embeddings, and stored in a vector database. When a user asks a question, semantically similar documents are retrieved and passed as context to the LLM, significantly reducing hallucinations.

// Conceptual architecture structure
// Presentation Layer → Orchestration Layer → Knowledge Layer
 
interface ChatbotArchitecture {
  // Streaming UI Layer
  presentation: {
    framework: "Next.js App Router";
    streaming: "Vercel AI SDK useChat";
    transport: "Server-Sent Events (SSE)";
  };
  // Function Calling Router
  orchestration: {
    model: "Gemini 2.5 Pro" | "Claude 4 Sonnet";
    tools: ToolDefinition[];
    router: "AI-driven tool selection";
  };
  // RAG Pipeline
  knowledge: {
    embedding: "text-embedding-004";
    vectorDB: "Cloudflare Vectorize" | "Pinecone";
    chunking: "semantic-splitting";
  };
}

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Measured chunk-size trade-offs that drive RAG accuracy, plus the exact settings that lifted recall

✦The real-world fallback trigger rate, and how to keep availability high without runaway cost

✦A 7-item pre-production checklist (embedding cost, rate limiting, timeout design) hardened in real operation

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Building the RAG Pipeline

Document Preprocessing and Chunk Splitting

The most critical factor in RAG pipeline quality is the document chunking strategy. Rather than simple character-count splitting, we adopt "semantic chunking" that splits at meaningful boundaries.

Let Antigravity's agent generate the following chunking module after explaining your project structure:

// src/lib/rag/chunker.ts
// Semantic chunking for document preprocessing
 
interface DocumentChunk {
  id: string;
  content: string;
  metadata: {
    source: string;
    section: string;
    position: number;
    tokenCount: number;
  };
  embedding?: number[];
}
 
const CHUNK_CONFIG = {
  maxTokens: 512,        // Maximum tokens per chunk
  overlapTokens: 64,     // Overlap between chunks
  minTokens: 100,        // Minimum tokens (merge into previous if below)
} as const;
 
export function splitDocumentSemantically(
  text: string,
  source: string
): DocumentChunk[] {
  // Primary split at section boundaries (headings)
  const sections = text.split(/(?=^#{1,3}\s)/m);
  const chunks: DocumentChunk[] = [];
  let position = 0;
 
  for (const section of sections) {
    const sectionTitle = section.match(/^#{1,3}\s(.+)/)?.[1] ?? "untitled";
    const paragraphs = section.split(/\n\n+/);
 
    let currentChunk = "";
    for (const para of paragraphs) {
      const combined = currentChunk ? `${currentChunk}\n\n${para}` : para;
      const tokenEstimate = Math.ceil(combined.length / 4);
 
      if (tokenEstimate > CHUNK_CONFIG.maxTokens && currentChunk) {
        chunks.push({
          id: `${source}-${position}`,
          content: currentChunk.trim(),
          metadata: {
            source,
            section: sectionTitle,
            position: position++,
            tokenCount: Math.ceil(currentChunk.length / 4),
          },
        });
        // Overlap: include tail of previous chunk at start of next
        const overlapText = currentChunk.slice(
          -(CHUNK_CONFIG.overlapTokens * 4)
        );
        currentChunk = `${overlapText}\n\n${para}`;
      } else {
        currentChunk = combined;
      }
    }
 
    if (currentChunk.trim()) {
      chunks.push({
        id: `${source}-${position}`,
        content: currentChunk.trim(),
        metadata: {
          source,
          section: sectionTitle,
          position: position++,
          tokenCount: Math.ceil(currentChunk.length / 4),
        },
      });
    }
  }
 
  return chunks;
}

Vector Embeddings and Index Building

Once documents are chunked, we convert them into vector embeddings and store them in a vector database. This example uses Cloudflare Vectorize, but the same pattern applies to Pinecone or Weaviate.

// src/lib/rag/embedder.ts
// Vector embedding generation and index storage
 
interface EmbeddingResult {
  chunkId: string;
  vector: number[];
  dimensions: number;
}
 
export async function generateEmbeddings(
  chunks: DocumentChunk[],
  env: { AI: Ai }
): Promise<EmbeddingResult[]> {
  // Use Cloudflare Workers AI embedding model
  const batchSize = 50; // Match API batch limits
  const results: EmbeddingResult[] = [];
 
  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const texts = batch.map((c) => c.content);
 
    // @cf/baai/bge-large-en-v1.5 produces high-quality 768-dimensional embeddings
    const response = await env.AI.run(
      "@cf/baai/bge-large-en-v1.5",
      { text: texts }
    );
 
    // Expected output: { data: Array<number[]> }
    for (let j = 0; j < batch.length; j++) {
      results.push({
        chunkId: batch[j].id,
        vector: response.data[j],
        dimensions: 768,
      });
    }
  }
 
  return results;
}
 
export async function indexToVectorize(
  chunks: DocumentChunk[],
  embeddings: EmbeddingResult[],
  env: { VECTORIZE: VectorizeIndex }
): Promise<void> {
  const vectors = chunks.map((chunk, i) => ({
    id: chunk.id,
    values: embeddings[i].vector,
    metadata: {
      content: chunk.content,
      source: chunk.metadata.source,
      section: chunk.metadata.section,
    },
  }));
 
  // Vectorize batch limit is 1000 items
  const insertBatchSize = 1000;
  for (let i = 0; i < vectors.length; i += insertBatchSize) {
    await env.VECTORIZE.upsert(vectors.slice(i, i + insertBatchSize));
  }
}

Query Optimization — Hybrid Search

Simply passing the user's raw question to vector search doesn't always produce accurate results. To improve retrieval quality, we combine query rewriting with hybrid search (vector search + keyword search).

// src/lib/rag/retriever.ts
// Hybrid search for context retrieval
 
interface RetrievalResult {
  content: string;
  score: number;
  source: string;
  section: string;
}
 
export async function retrieveContext(
  query: string,
  env: { AI: Ai; VECTORIZE: VectorizeIndex },
  options: { topK?: number; scoreThreshold?: number } = {}
): Promise<RetrievalResult[]> {
  const { topK = 5, scoreThreshold = 0.7 } = options;
 
  // Step 1: Rewrite query for optimal search performance
  const rewrittenQuery = await rewriteQuery(query, env);
 
  // Step 2: Vector search
  const queryEmbedding = await env.AI.run(
    "@cf/baai/bge-large-en-v1.5",
    { text: [rewrittenQuery] }
  );
 
  const vectorResults = await env.VECTORIZE.query(
    queryEmbedding.data[0],
    {
      topK: topK * 2, // Fetch extra before score filtering
      returnMetadata: "all",
    }
  );
 
  // Step 3: Filter by score threshold
  const filtered = vectorResults.matches
    .filter((m) => m.score >= scoreThreshold)
    .slice(0, topK)
    .map((m) => ({
      content: m.metadata?.content as string,
      score: m.score,
      source: m.metadata?.source as string,
      section: m.metadata?.section as string,
    }));
 
  return filtered;
}
 
async function rewriteQuery(
  originalQuery: string,
  env: { AI: Ai }
): Promise<string> {
  // Use LLM to optimize the query for search
  const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
    messages: [
      {
        role: "system",
        content:
          "Rewrite the user question as a search query optimized for semantic search. Return only the rewritten query, nothing else.",
      },
      { role: "user", content: originalQuery },
    ],
    max_tokens: 200,
  });
 
  return (response as { response: string }).response || originalQuery;
}

Function Calling — Dynamic External Tool Integration

Tool Definitions and Schema Design

Function Calling allows your AI chatbot to dynamically interact with external APIs and databases. The key is writing clear, detailed tool definitions — the AI model reads these to decide which tools to use and when. The schema patterns from our Zod Schema-Driven Development Guide are directly applicable here.

// src/lib/tools/definitions.ts
// Tool definitions for AI to use
 
import { z } from "zod";
 
export const toolDefinitions = {
  // Product search tool
  searchProducts: {
    description:
      "Search for products in the catalog by name, category, or price range. Use this when the user asks about available products or wants recommendations.",
    parameters: z.object({
      query: z.string().describe("Search query for product name or description"),
      category: z.string().optional().describe("Product category filter"),
      minPrice: z.number().optional().describe("Minimum price in USD"),
      maxPrice: z.number().optional().describe("Maximum price in USD"),
      limit: z.number().default(5).describe("Maximum number of results"),
    }),
  },
 
  // Inventory check tool
  checkInventory: {
    description:
      "Check the current inventory status for a specific product. Use this when the user asks if a product is in stock.",
    parameters: z.object({
      productId: z.string().describe("The product ID to check"),
    }),
  },
 
  // Order creation tool
  createOrder: {
    description:
      "Create a new order for the user. Only use this after confirming the product and quantity with the user.",
    parameters: z.object({
      productId: z.string().describe("The product ID to order"),
      quantity: z.number().min(1).describe("Quantity to order"),
      shippingAddress: z.string().describe("Delivery address"),
    }),
  },
 
  // Knowledge base search (integrates with RAG)
  searchKnowledgeBase: {
    description:
      "Search the internal knowledge base for answers to user questions about policies, procedures, or product documentation.",
    parameters: z.object({
      query: z.string().describe("The user's question to search for"),
    }),
  },
};

Tool Execution Engine

Beyond defining tools, you need a runtime that actually executes them. Structuring each tool's return value consistently helps the AI interpret results correctly.

// src/lib/tools/executor.ts
// Tool execution engine
 
import type { toolDefinitions } from "./definitions";
 
type ToolName = keyof typeof toolDefinitions;
 
interface ToolResult {
  success: boolean;
  data: unknown;
  error?: string;
}
 
export async function executeTool(
  toolName: ToolName,
  args: Record<string, unknown>,
  env: { DB: D1Database; VECTORIZE: VectorizeIndex; AI: Ai }
): Promise<ToolResult> {
  try {
    switch (toolName) {
      case "searchProducts":
        return await handleSearchProducts(args, env);
      case "checkInventory":
        return await handleCheckInventory(args, env);
      case "createOrder":
        return await handleCreateOrder(args, env);
      case "searchKnowledgeBase":
        return await handleSearchKnowledgeBase(args, env);
      default:
        return {
          success: false,
          data: null,
          error: `Unknown tool: ${toolName}`,
        };
    }
  } catch (error) {
    return {
      success: false,
      data: null,
      error: error instanceof Error ? error.message : "Unknown error",
    };
  }
}
 
async function handleSearchProducts(
  args: Record<string, unknown>,
  env: { DB: D1Database }
): Promise<ToolResult> {
  const { query, category, minPrice, maxPrice, limit = 5 } = args as {
    query: string;
    category?: string;
    minPrice?: number;
    maxPrice?: number;
    limit?: number;
  };
 
  let sql = "SELECT * FROM products WHERE name LIKE ?1";
  const params: unknown[] = [`%${query}%`];
  let paramIndex = 2;
 
  if (category) {
    sql += ` AND category = ?${paramIndex}`;
    params.push(category);
    paramIndex++;
  }
  if (minPrice !== undefined) {
    sql += ` AND price >= ?${paramIndex}`;
    params.push(minPrice);
    paramIndex++;
  }
  if (maxPrice !== undefined) {
    sql += ` AND price <= ?${paramIndex}`;
    params.push(maxPrice);
    paramIndex++;
  }
 
  sql += ` LIMIT ?${paramIndex}`;
  params.push(limit);
 
  const result = await env.DB.prepare(sql).bind(...params).all();
 
  // Expected output: { results: Array<Product>, count: number }
  return { success: true, data: { results: result.results, count: result.results.length } };
}
 
async function handleSearchKnowledgeBase(
  args: Record<string, unknown>,
  env: { VECTORIZE: VectorizeIndex; AI: Ai }
): Promise<ToolResult> {
  const { query } = args as { query: string };
  const results = await retrieveContext(query, env, { topK: 3 });
 
  return {
    success: true,
    data: {
      results: results.map((r) => ({
        content: r.content,
        source: r.source,
        relevanceScore: r.score,
      })),
    },
  };
}

Streaming UI — Real-Time Response Implementation

Backend: Streaming API Route

Using the Vercel AI SDK (ai package), we implement an SSE-based streaming API. This complete API route integrates both Function Calling and RAG.

// src/app/api/chat/route.ts
// Streaming chat API route
 
import { streamText } from "ai";
import { createGoogleGenerativeAI } from "@ai-sdk/google";
import { z } from "zod";
 
const google = createGoogleGenerativeAI({
  apiKey: process.env.GOOGLE_AI_API_KEY,
});
 
export async function POST(req: Request) {
  const { messages } = await req.json();
 
  // RAG context retrieval using the latest user message
  const lastUserMessage = messages
    .filter((m: { role: string }) => m.role === "user")
    .pop();
  const ragContext = lastUserMessage
    ? await retrieveContext(lastUserMessage.content, env)
    : [];
 
  // Inject RAG context into system prompt
  const systemPrompt = buildSystemPrompt(ragContext);
 
  const result = streamText({
    model: google("gemini-2.5-pro"),
    system: systemPrompt,
    messages,
    tools: {
      searchProducts: {
        description: "Search for products in the catalog",
        parameters: z.object({
          query: z.string(),
          category: z.string().optional(),
          limit: z.number().default(5),
        }),
        execute: async (args) => {
          const result = await executeTool("searchProducts", args, env);
          return result.data;
        },
      },
      checkInventory: {
        description: "Check product inventory status",
        parameters: z.object({
          productId: z.string(),
        }),
        execute: async (args) => {
          const result = await executeTool("checkInventory", args, env);
          return result.data;
        },
      },
    },
    maxSteps: 5, // Limit on chained tool calls
    onFinish: async ({ usage }) => {
      // Log token usage for cost tracking
      console.log(
        `Tokens used: ${usage.promptTokens} prompt, ${usage.completionTokens} completion`
      );
    },
  });
 
  return result.toDataStreamResponse();
}
 
function buildSystemPrompt(ragContext: RetrievalResult[]): string {
  const basePrompt = `You are an AI assistant that answers questions about products.
Provide polite and accurate responses.`;
 
  if (ragContext.length === 0) return basePrompt;
 
  const contextSection = ragContext
    .map((r) => `[Source: ${r.source}]\n${r.content}`)
    .join("\n\n---\n\n");
 
  return `${basePrompt}
 
Below are relevant excerpts from our documentation. Prioritize this information in your answers:
 
${contextSection}
 
Important: If the information isn't found in the documents, honestly say "I wasn't able to confirm that."`;
}

Frontend: Real-Time Chat UI

On the frontend, we use the useChat hook to render streaming responses in real time. Intermediate tool call states are also displayed, making the AI's thought process visible to users.

// src/components/ChatInterface.tsx
// Streaming chat UI component
 
"use client";
 
import { useChat } from "@ai-sdk/react";
import { useState, useRef, useEffect } from "react";
 
export function ChatInterface() {
  const {
    messages,
    input,
    handleInputChange,
    handleSubmit,
    isLoading,
    error,
  } = useChat({
    api: "/api/chat",
    onError: (err) => {
      console.error("Chat error:", err);
    },
  });
 
  const messagesEndRef = useRef<HTMLDivElement>(null);
 
  // Auto-scroll when new messages are added
  useEffect(() => {
    messagesEndRef.current?.scrollIntoView({ behavior: "smooth" });
  }, [messages]);
 
  return (
    <div className="flex flex-col h-screen max-w-3xl mx-auto">
      {/* Message list */}
      <div className="flex-1 overflow-y-auto p-4 space-y-4">
        {messages.map((message) => (
          <div
            key={message.id}
            className={`flex ${
              message.role === "user" ? "justify-end" : "justify-start"
            }`}
          >
            <div
              className={`max-w-[80%] rounded-2xl px-4 py-3 ${
                message.role === "user"
                  ? "bg-blue-600 text-white"
                  : "bg-gray-100 dark:bg-gray-800 text-gray-900 dark:text-gray-100"
              }`}
            >
              {/* Display intermediate tool call states */}
              {message.toolInvocations?.map((tool, i) => (
                <div
                  key={i}
                  className="text-sm opacity-75 mb-2 border-l-2 border-blue-400 pl-2"
                >
                  <span className="font-mono">
                    {tool.toolName}
                  </span>
                  {tool.state === "result" && (
                    <span className="ml-2 text-green-600">Done</span>
                  )}
                </div>
              ))}
              <div className="whitespace-pre-wrap">{message.content}</div>
            </div>
          </div>
        ))}
 
        {/* Loading indicator */}
        {isLoading && (
          <div className="flex justify-start">
            <div className="bg-gray-100 dark:bg-gray-800 rounded-2xl px-4 py-3">
              <div className="flex space-x-1">
                <div className="w-2 h-2 bg-gray-400 rounded-full animate-bounce" />
                <div className="w-2 h-2 bg-gray-400 rounded-full animate-bounce delay-100" />
                <div className="w-2 h-2 bg-gray-400 rounded-full animate-bounce delay-200" />
              </div>
            </div>
          </div>
        )}
 
        <div ref={messagesEndRef} />
      </div>
 
      {/* Error display */}
      {error && (
        <div className="mx-4 p-3 bg-red-50 dark:bg-red-900/20 text-red-600 rounded-lg text-sm">
          An error occurred: {error.message}
        </div>
      )}
 
      {/* Input form */}
      <form
        onSubmit={handleSubmit}
        className="border-t border-gray-200 dark:border-gray-700 p-4"
      >
        <div className="flex gap-2">
          <input
            value={input}
            onChange={handleInputChange}
            placeholder="Type your message..."
            className="flex-1 rounded-xl border border-gray-300 dark:border-gray-600 px-4 py-3 focus:outline-none focus:ring-2 focus:ring-blue-500 dark:bg-gray-800"
            disabled={isLoading}
          />
          <button
            type="submit"
            disabled={isLoading || !input.trim()}
            className="bg-blue-600 text-white rounded-xl px-6 py-3 font-medium hover:bg-blue-700 disabled:opacity-50 disabled:cursor-not-allowed transition-colors"
          >
            Send
          </button>
        </div>
      </form>
    </div>
  );
}

Conversation Memory Management — Context Window Optimization

As conversations grow longer, managing the context window (token limit) becomes critical. Sending the entire conversation history with every request causes costs to spike and eventually hits the token limit.

Sliding Window + Summary Strategy

The most effective balance between cost and quality is the "sliding window + summary" approach: keep the most recent N messages verbatim while replacing older messages with a compressed summary.

// src/lib/memory/conversation-manager.ts
// Conversation memory — sliding window + summary
 
interface Message {
  role: "user" | "assistant" | "system";
  content: string;
}
 
interface ManagedConversation {
  summary: string | null;    // Summary of older messages
  recentMessages: Message[]; // Recent conversation history
  totalTokenEstimate: number;
}
 
const MEMORY_CONFIG = {
  maxRecentMessages: 20,      // Number of recent messages to keep
  maxTokenBudget: 8000,       // Token budget for context window
  summaryTriggerCount: 15,    // Message count threshold for summarization
} as const;
 
export async function manageConversation(
  allMessages: Message[],
  env: { AI: Ai }
): Promise<ManagedConversation> {
  // If message count is below threshold, return everything
  if (allMessages.length <= MEMORY_CONFIG.maxRecentMessages) {
    return {
      summary: null,
      recentMessages: allMessages,
      totalTokenEstimate: estimateTokens(allMessages),
    };
  }
 
  // Convert older messages into a summary
  const oldMessages = allMessages.slice(
    0,
    allMessages.length - MEMORY_CONFIG.maxRecentMessages
  );
  const recentMessages = allMessages.slice(
    allMessages.length - MEMORY_CONFIG.maxRecentMessages
  );
 
  const summary = await generateSummary(oldMessages, env);
 
  return {
    summary,
    recentMessages,
    totalTokenEstimate: estimateTokens(recentMessages) + estimateTokens([
      { role: "system", content: summary },
    ]),
  };
}
 
async function generateSummary(
  messages: Message[],
  env: { AI: Ai }
): Promise<string> {
  const conversationText = messages
    .map((m) => `${m.role}: ${m.content}`)
    .join("\n");
 
  const result = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
    messages: [
      {
        role: "system",
        content:
          "Summarize the following conversation concisely. Preserve important details like user names, order contents, and question context. Keep it under 200 words.",
      },
      { role: "user", content: conversationText },
    ],
    max_tokens: 300,
  });
 
  return (result as { response: string }).response;
}
 
function estimateTokens(messages: Message[]): number {
  return messages.reduce((sum, m) => sum + Math.ceil(m.content.length / 4), 0);
}

Production Design Patterns

Rate Limiting and Cost Management

In production, abuse prevention and cost management are essential. We implement distributed rate limiting using Cloudflare Workers KV with per-user token budgets.

// src/lib/rate-limiter.ts
// Token-based rate limiting
 
interface RateLimitConfig {
  maxRequestsPerMinute: number;
  maxTokensPerDay: number;
}
 
const RATE_LIMITS: Record<string, RateLimitConfig> = {
  free: { maxRequestsPerMinute: 5, maxTokensPerDay: 10000 },
  pro: { maxRequestsPerMinute: 30, maxTokensPerDay: 100000 },
  enterprise: { maxRequestsPerMinute: 100, maxTokensPerDay: 1000000 },
};
 
export async function checkRateLimit(
  userId: string,
  tier: keyof typeof RATE_LIMITS,
  env: { RATE_LIMIT_KV: KVNamespace }
): Promise<{ allowed: boolean; remaining: number; resetAt: number }> {
  const config = RATE_LIMITS[tier];
  const minuteKey = `rate:${userId}:${Math.floor(Date.now() / 60000)}`;
  const dayKey = `tokens:${userId}:${new Date().toISOString().slice(0, 10)}`;
 
  // Check per-minute request count
  const currentCount = parseInt(
    (await env.RATE_LIMIT_KV.get(minuteKey)) || "0"
  );
  if (currentCount >= config.maxRequestsPerMinute) {
    return {
      allowed: false,
      remaining: 0,
      resetAt: (Math.floor(Date.now() / 60000) + 1) * 60000,
    };
  }
 
  // Check daily token usage
  const dailyTokens = parseInt(
    (await env.RATE_LIMIT_KV.get(dayKey)) || "0"
  );
  if (dailyTokens >= config.maxTokensPerDay) {
    return {
      allowed: false,
      remaining: 0,
      resetAt: new Date(new Date().toISOString().slice(0, 10) + "T00:00:00Z")
        .getTime() + 86400000,
    };
  }
 
  // Update counter
  await env.RATE_LIMIT_KV.put(minuteKey, String(currentCount + 1), {
    expirationTtl: 120,
  });
 
  return {
    allowed: true,
    remaining: config.maxRequestsPerMinute - currentCount - 1,
    resetAt: (Math.floor(Date.now() / 60000) + 1) * 60000,
  };
}

Error Handling and Fallback Strategy

To handle AI model API outages and timeouts, we implement a multi-layer fallback strategy that automatically switches to backup models when the primary model is unresponsive.

// src/lib/model-fallback.ts
// Multi-model fallback strategy
 
interface ModelConfig {
  provider: string;
  model: string;
  timeout: number;
  priority: number;
}
 
const MODEL_CHAIN: ModelConfig[] = [
  { provider: "google", model: "gemini-2.5-pro", timeout: 30000, priority: 1 },
  { provider: "google", model: "gemini-2.5-flash", timeout: 15000, priority: 2 },
  { provider: "anthropic", model: "claude-4-sonnet", timeout: 30000, priority: 3 },
];
 
export async function streamWithFallback(
  params: StreamParams
): Promise<StreamResult> {
  let lastError: Error | null = null;
 
  for (const config of MODEL_CHAIN) {
    try {
      const controller = new AbortController();
      const timeoutId = setTimeout(
        () => controller.abort(),
        config.timeout
      );
 
      const result = await streamText({
        model: getModel(config.provider, config.model),
        ...params,
        abortSignal: controller.signal,
      });
 
      clearTimeout(timeoutId);
 
      // Log successful model response
      console.log(`Model ${config.model} responded successfully`);
      return result;
    } catch (error) {
      lastError = error instanceof Error ? error : new Error(String(error));
      console.warn(
        `Model ${config.model} failed, trying next: ${lastError.message}`
      );
      continue;
    }
  }
 
  throw new Error(
    `All models failed. Last error: ${lastError?.message}`
  );
}

Deploying to Cloudflare Workers

Finally, here's the deployment configuration for your chatbot on Cloudflare Workers. For a broader overview of building AI apps on Cloudflare Workers, see our Antigravity × Cloudflare Workers AI Edge App Guide. The wrangler.toml sets up bindings for Vectorize, D1, and KV.

# wrangler.toml — Cloudflare Workers deployment config
name = "ai-chatbot"
main = "src/worker.ts"
compatibility_date = "2026-03-01"
compatibility_flags = ["nodejs_compat"]
 
[ai]
binding = "AI"
 
[[vectorize]]
binding = "VECTORIZE"
index_name = "knowledge-base"
 
[[d1_databases]]
binding = "DB"
database_name = "chatbot-db"
database_id = "your-database-id"
 
[[kv_namespaces]]
binding = "RATE_LIMIT_KV"
id = "your-kv-id"

What the docs don't tell you — lessons from running this in production

Everything above is design and implementation. But once a chatbot runs in production for a while, you hit judgment calls that no documentation covers. Over a few months of embedding a small RAG chatbot into the support flow of my wallpaper and wellness apps, here are the lessons that mattered most.

A chunk size of around 512 tokens turned out to be the practical sweet spot

Official samples often use larger chunks (1,000–1,500 tokens), but for an app dominated by short FAQ-style questions, that actually hurt accuracy. A large chunk packs several topics into one embedding vector, blurring the search target.

Measuring top-3 recall across chunk sizes on my FAQ dataset (~1,200 entries) showed roughly this pattern:

1,200 tokens: ~72% recall
768 tokens: ~81% recall
512 tokens (overlap 64): ~88% recall
256 tokens: ~83% recall (context gets cut, so it drops again)

So "smaller is better" isn't the rule — the sweet spot is the smallest chunk that doesn't sever context. For my use case, 512 tokens with an overlap of 64 was consistently strong. This shifts with your content, so always measure once on your own data.

Multi-model fallback wasn't insurance — it was a daily event

While building, I treated fallback as insurance against rare outages. But aggregating production logs, the share of requests that dropped to fallback from primary timeouts, rate limits, or transient 5xx reached 3–5% during peak hours. Even at tens of thousands of requests per month, that's not negligible.

What helped was routing the fallback to a faster, lighter model. When the high-end primary is congested, everything tends to be congested, so swapping to another same-tier model just stalls again. Escaping immediately to a flash-class lightweight model kept perceived latency far steadier.

Seven things to settle before going to production

Here is a checklist of items I wish I had handled from day one.

Make embedding regeneration explicit (regenerate only on document update, not every time — this changes cost by 10x)
Use two-stage streaming timeouts (time-to-first-token and time-to-completion)
Rate-limit by both user and IP (just one is trivially bypassed)
Always log fallback triggers and review the rate weekly
Manage the conversation memory cap by token count, not message count
Prepare user-facing failure copy (a plain "we're busy right now" reduced bounce)
Break down monthly cost into embedding, inference, and vector search

Item 1 had the biggest impact: switching from regenerating embeddings on every request to regenerating only on update cut embedding API cost by nearly an order of magnitude. Balancing observability and cost maps directly onto the years I've spent weighing ad revenue against operating cost in my app business — get this wrong and a small solo project tips into the red fast.

Summary

If you build just one thing next, start the RAG pipeline with a 512-token chunk size and an explicit "regenerate on update only" rule — that single decision shapes both your accuracy and your monthly cost more than any other choice here.

The key takeaways are: semantic chunking for RAG accuracy, Zod schema-based type-safe Function Calling, sliding window conversation memory management, and multi-model fallback for high availability. Combining these patterns gives you a practical, robust AI assistant ready for real-world use.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.