Resilient AI Agents in Antigravity — Retry, Circuit Breakers, and Fallback Strategies for Production
Build fault-tolerant AI agents in Antigravity with production-grade retry strategies, circuit breakers, model fallback chains, and checkpoint recovery. Every pattern you need to keep agents running reliably.
Everything works fine when you're testing your Antigravity AI agents in development. Then you deploy to production, and things break in ways you never imagined. The LLM API returns a 429. Responses get cut off mid-stream. Token limits get hit and the agent halts in an inconsistent state.
I confronted this problem head-on three days after deploying a multi-agent pipeline to production. At 2 AM, Gemini API rate limits kicked in and five agents failed in a cascade. Recovery took four hours.
This guide shares every resilience pattern I built after that incident. Circuit breakers, retry strategies, model fallback, checkpoint design — everything you need to keep AI agents stable in production, with working code you can drop into your Antigravity project.
Designing Retry Strategies — Exponential Backoff and Jitter Done Right
AI agent API calls demand different retry strategies than traditional web services. LLM APIs take seconds to tens of seconds to respond, so naive retries cause wait times to explode. When multiple agents retry simultaneously, you get "retry storms" that make the API situation worse.
Why Fixed-Interval Retries Fail
Fixed-interval retries (say, 3 seconds every time) cause all agents to hit the API at the same instant during an outage. This is the classic "Thundering Herd" problem, and it actively delays API recovery.
Exponential Backoff with Jitter is the standard solution. You increase wait time exponentially while adding randomness to spread retries across time.
// Retry utility for Antigravity projects// Wraps agent API calls with automatic retry logicinterface RetryConfig { maxRetries: number; // Maximum retry attempts baseDelayMs: number; // Initial wait time (milliseconds) maxDelayMs: number; // Wait time ceiling jitterFactor: number; // Jitter coefficient (0–1) retryableErrors: number[]; // HTTP status codes eligible for retry}const DEFAULT_CONFIG: RetryConfig = { maxRetries: 5, baseDelayMs: 1000, maxDelayMs: 60000, jitterFactor: 0.5, retryableErrors: [429, 500, 502, 503, 504],};async function withRetry<T>( operation: () => Promise<T>, config: Partial<RetryConfig> = {}): Promise<T> { const cfg = { ...DEFAULT_CONFIG, ...config }; let lastError: Error | null = null; for (let attempt = 0; attempt <= cfg.maxRetries; attempt++) { try { return await operation(); } catch (error: unknown) { lastError = error instanceof Error ? error : new Error(String(error)); // Non-retryable errors fail immediately const status = (error as { status?: number }).status; if (status && \!cfg.retryableErrors.includes(status)) { throw error; } if (attempt === cfg.maxRetries) { break; // Exhausted all retries } // Exponential Backoff + Full Jitter const exponentialDelay = cfg.baseDelayMs * Math.pow(2, attempt); const cappedDelay = Math.min(exponentialDelay, cfg.maxDelayMs); const jitter = cappedDelay * cfg.jitterFactor * Math.random(); const finalDelay = cappedDelay - (cappedDelay * cfg.jitterFactor / 2) + jitter; console.warn( `[Retry ${attempt + 1}/${cfg.maxRetries}] ` + `Waiting ${Math.round(finalDelay)}ms before retry. ` + `Error: ${lastError.message}` ); await new Promise(resolve => setTimeout(resolve, finalDelay)); } } throw new Error( `Operation failed after ${cfg.maxRetries} retries. Last error: ${lastError?.message}` );}// Usage: Apply retry logic to a Gemini API callconst response = await withRetry( () => fetch('https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-pro:generateContent', { method: 'POST', headers: { 'Content-Type': 'application/json', 'x-goog-api-key': process.env.GEMINI_API_KEY ?? '', }, body: JSON.stringify({ contents: [{ parts: [{ text: prompt }] }] }), }), { maxRetries: 3, baseDelayMs: 2000 });// Expected: On 429, retries at ~2s → ~4s → ~8s intervals, returns response on success
Retries and Idempotency
Here's a subtlety that's easy to miss. Retrying means potentially executing the same operation multiple times. LLM queries themselves are idempotent, but when agents perform side effects — writing files, calling external APIs — retries can cause duplicate executions.
The solution is recording checkpoints before executing actions, then checking "was this already done?" on retry. We'll cover this in detail in a later section.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦You'll be able to solve the problem of agents silently failing in production by implementing circuit breakers and intelligent retry strategies
✦You'll master model fallback chains, checkpoint recovery, and graceful degradation patterns that you can apply to your own products immediately
✦You'll build a monitoring foundation that detects and recovers from agent failures automatically — no more 2 AM incident response
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Retries alone keep hammering a service that's fundamentally down. A circuit breaker automatically blocks requests after a threshold of consecutive failures, giving the service time to recover.
Three States
A circuit breaker has three states:
Closed: Normal operation. Requests pass through. Failures are counted, and the breaker transitions to Open when the threshold is hit.
Open: Blocked. Requests are immediately rejected and fallback logic executes. After a timeout period, the breaker transitions to Half-Open.
Half-Open: A single test request is allowed through. If it succeeds, the breaker returns to Closed. If it fails, it returns to Open.
// Circuit breaker for AI agent API and tool calls// Automatically blocks requests to failing dependenciestype CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';interface CircuitBreakerConfig { failureThreshold: number; // Failures needed to trip to Open resetTimeoutMs: number; // Wait time before trying Half-Open halfOpenMaxAttempts: number; // Test requests allowed in Half-Open monitorWindowMs: number; // Sliding window for failure counting}class CircuitBreaker { private state: CircuitState = 'CLOSED'; private failures: number[] = []; // Timestamp array private lastOpenTime = 0; private halfOpenAttempts = 0; private consecutiveSuccesses = 0; constructor( private name: string, private config: CircuitBreakerConfig = { failureThreshold: 5, resetTimeoutMs: 30000, halfOpenMaxAttempts: 3, monitorWindowMs: 60000, } ) {} async execute<T>( operation: () => Promise<T>, fallback?: () => Promise<T> ): Promise<T> { // Evict stale failures outside the monitoring window const now = Date.now(); this.failures = this.failures.filter( t => now - t < this.config.monitorWindowMs ); if (this.state === 'OPEN') { if (now - this.lastOpenTime >= this.config.resetTimeoutMs) { this.state = 'HALF_OPEN'; this.halfOpenAttempts = 0; this.consecutiveSuccesses = 0; console.info(`[CircuitBreaker:${this.name}] OPEN → HALF_OPEN`); } else { console.warn(`[CircuitBreaker:${this.name}] Circuit OPEN — executing fallback`); if (fallback) return fallback(); throw new Error(`Circuit breaker "${this.name}" is OPEN`); } } if (this.state === 'HALF_OPEN' && this.halfOpenAttempts >= this.config.halfOpenMaxAttempts) { this.transitionToOpen(); if (fallback) return fallback(); throw new Error(`Circuit breaker "${this.name}" — half-open attempts exhausted`); } try { const result = await operation(); this.onSuccess(); return result; } catch (error) { this.onFailure(); if (this.state === 'OPEN' && fallback) { return fallback(); } throw error; } } private onSuccess(): void { if (this.state === 'HALF_OPEN') { this.consecutiveSuccesses++; this.halfOpenAttempts++; if (this.consecutiveSuccesses >= 2) { this.state = 'CLOSED'; this.failures = []; console.info(`[CircuitBreaker:${this.name}] HALF_OPEN → CLOSED (recovered)`); } } } private onFailure(): void { this.failures.push(Date.now()); if (this.state === 'HALF_OPEN') { this.transitionToOpen(); return; } if (this.failures.length >= this.config.failureThreshold) { this.transitionToOpen(); } } private transitionToOpen(): void { this.state = 'OPEN'; this.lastOpenTime = Date.now(); console.warn( `[CircuitBreaker:${this.name}] → OPEN ` + `(failures: ${this.failures.length}, timeout: ${this.config.resetTimeoutMs}ms)` ); } getState(): CircuitState { return this.state; }}// Usage: Separate breakers for each dependencyconst geminiBreaker = new CircuitBreaker('gemini-api', { failureThreshold: 3, resetTimeoutMs: 30000, halfOpenMaxAttempts: 2, monitorWindowMs: 60000,});const gemmaBreaker = new CircuitBreaker('gemma-local', { failureThreshold: 2, // Local model failures are often severe resetTimeoutMs: 10000, // Recovery is faster for local models halfOpenMaxAttempts: 1, monitorWindowMs: 30000,});// Expected: geminiBreaker opens after 3 failures within 60s,// waits 30s, then tests with Half-Open. 2 consecutive successes → Closed.
Why Each Dependency Needs Its Own Circuit Breaker
In Antigravity's multi-agent architecture, each agent depends on different APIs and tools. A single shared circuit breaker means one API's outage takes down agents that don't even use that API. Isolate breakers by dependency — always.
Model Fallback — Having a "Plan B" for Your AI
The most effective defense against LLM API outages is falling back to a different model. Antigravity supports switching between multiple AI models, and this guide shows you how to exploit that capability to the fullest.
Designing a Fallback Chain
Model fallback introduces a quality-versus-availability tradeoff. You need to decide in advance how much quality you're willing to sacrifice when your primary model is unavailable.
If your team doesn't know fallbacks are happening, quality silently degrades without anyone noticing. Record fallback events as metrics and surface them on your dashboard — this is non-negotiable for production systems.
Checkpoint Design — Recording "How Far Did We Get?"
When an AI agent is executing a long-running task and fails midway, starting over from scratch wastes time and resources. Checkpoints let you resume from the last successful step.
Implementation Pattern: Step-Based Checkpoints
// Checkpoint manager for agent tasks// Persists completion state for each step, enabling resume after failureimport { readFile, writeFile, mkdir } from 'fs/promises';import { join } from 'path';interface CheckpointData { taskId: string; currentStep: number; totalSteps: number; completedSteps: Record<string, { result: unknown; completedAt: string; durationMs: number; }>; metadata: Record<string, unknown>; lastUpdated: string;}class CheckpointManager { private checkpointDir: string; constructor(baseDir: string = '.antigravity/checkpoints') { this.checkpointDir = baseDir; } private getPath(taskId: string): string { return join(this.checkpointDir, `${taskId}.json`); } async load(taskId: string): Promise<CheckpointData | null> { try { const raw = await readFile(this.getPath(taskId), 'utf-8'); return JSON.parse(raw) as CheckpointData; } catch { return null; // No checkpoint — start from the beginning } } async save(data: CheckpointData): Promise<void> { await mkdir(this.checkpointDir, { recursive: true }); data.lastUpdated = new Date().toISOString(); await writeFile(this.getPath(data.taskId), JSON.stringify(data, null, 2)); } async markStepComplete( taskId: string, stepName: string, result: unknown, durationMs: number ): Promise<void> { const data = await this.load(taskId); if (\!data) throw new Error(`No checkpoint found for task: ${taskId}`); data.completedSteps[stepName] = { result, completedAt: new Date().toISOString(), durationMs, }; data.currentStep++; await this.save(data); } isStepCompleted(checkpoint: CheckpointData, stepName: string): boolean { return stepName in checkpoint.completedSteps; }}// Usage: Multi-step code analysis agent with checkpoint recoveryasync function runCodeAnalysisAgent(repoPath: string): Promise<void> { const checkpoints = new CheckpointManager(); const taskId = `code-analysis-${repoPath.replace(/\//g, '-')}`; let checkpoint = await checkpoints.load(taskId); if (checkpoint) { console.log( `Resuming from step ${checkpoint.currentStep}/${checkpoint.totalSteps}` ); } else { checkpoint = { taskId, currentStep: 0, totalSteps: 4, completedSteps: {}, metadata: { repoPath }, lastUpdated: new Date().toISOString(), }; await checkpoints.save(checkpoint); } const steps = [ { name: 'scan-files', fn: () => scanRepository(repoPath) }, { name: 'analyze-deps', fn: () => analyzeDependencies(repoPath) }, { name: 'security-audit', fn: () => runSecurityAudit(repoPath) }, { name: 'generate-report', fn: () => generateReport(repoPath) }, ]; for (const step of steps) { if (checkpoints.isStepCompleted(checkpoint, step.name)) { console.log(`Skipping completed step: ${step.name}`); continue; } const start = Date.now(); try { const result = await step.fn(); await checkpoints.markStepComplete( taskId, step.name, result, Date.now() - start ); console.log(`✅ ${step.name} completed in ${Date.now() - start}ms`); } catch (error) { console.error(`❌ ${step.name} failed: ${(error as Error).message}`); console.log('Checkpoint saved. Re-run to resume from this step.'); throw error; } }}// Placeholder functions (replace with actual implementations)async function scanRepository(path: string) { return { files: 150 }; }async function analyzeDependencies(path: string) { return { deps: 42 }; }async function runSecurityAudit(path: string) { return { issues: 3 }; }async function generateReport(path: string) { return { reportPath: '/tmp/report.md' }; }// Expected (second run, step 1 already completed):// Resuming from step 1/4// Skipping completed step: scan-files// ✅ analyze-deps completed in 2341ms// ✅ security-audit completed in 5102ms// ✅ generate-report completed in 1203ms
The Checkpoint Staleness Trap
Checkpoints have a staleness problem. If external data changes after a checkpoint is saved, resuming from an old checkpoint creates inconsistencies between early and late steps. Set a TTL (time-to-live) on checkpoints and discard any that exceed it — restart from scratch when in doubt.
Graceful Degradation — Choose Reduced Service Over Total Failure
Even when an agent can't deliver 100% functionality, returning some value to users beats returning nothing. This is the principle of graceful degradation.
Pitfall 1: Thinking More Retries Will Fix Everything
Setting retries to 10 or 20 accomplishes nothing when the API is fundamentally down. Excessive retries actually delay API recovery by piling on load. Cap retries at around 5 and combine them with circuit breakers.
Pitfall 2: Not Distinguishing Error Types
Treating all errors with the same retry strategy is dangerous. A 429 (rate limit) will resolve with waiting, but a 401 (auth failure) or 400 (bad request) will never succeed no matter how many times you retry. Separate "retryable" from "fail immediately" based on error codes.
Pitfall 3: Ignoring Quality Differences Between Fallback Models
When you fall back from Gemini Pro to Gemini Flash, output quality can change significantly. Adapt your prompts during fallback — add more detailed instructions, enforce stricter output formats — to mitigate the quality drop.
Pitfall 4: Not Considering Checkpoint Storage
If you're writing checkpoints to the filesystem, consider disk capacity and I/O performance. Many agents writing checkpoints simultaneously can create I/O bottlenecks. For production, consider Redis or a KV store.
Pitfall 5: Never Simulating Failures in Test Environments
Implementing resilience patterns without testing them under actual failure conditions is pointless. Adopt Chaos Engineering practices — deliberately inject API errors and timeouts in your test suite.
// Fault injection test — verify circuit breaker transitions// Runs in Antigravity's test runnerasync function testCircuitBreakerTransition(): Promise<void> { const breaker = new CircuitBreaker('test-breaker', { failureThreshold: 3, resetTimeoutMs: 1000, halfOpenMaxAttempts: 2, monitorWindowMs: 5000, }); let callCount = 0; const failingOperation = async () => { callCount++; throw new Error('Simulated API failure'); }; // Should transition to Open after 3 failures for (let i = 0; i < 3; i++) { try { await breaker.execute(failingOperation); } catch { /* expected */ } } console.assert( breaker.getState() === 'OPEN', `Expected OPEN, got ${breaker.getState()}` ); console.log(`✅ Circuit opened after ${callCount} failures`); // During Open state, fallback should be called const fallbackResult = await breaker.execute( failingOperation, async () => 'fallback-value' ); console.assert( fallbackResult === 'fallback-value', 'Fallback should be returned when circuit is OPEN' ); console.log('✅ Fallback executed during OPEN state'); // After 1 second, should transition to Half-Open await new Promise(r => setTimeout(r, 1100)); console.assert( breaker.getState() === 'OPEN', // Still OPEN until next execute() call 'Should still be OPEN until next execute()' ); console.log('✅ All circuit breaker transitions verified');}testCircuitBreakerTransition();// Expected:// ✅ Circuit opened after 3 failures// ✅ Fallback executed during OPEN state// ✅ All circuit breaker transitions verified
Monitoring and Alerting — Minimizing Time from Detection to Recovery
Even with resilience patterns in place, you can't improve what you can't see. Continuous monitoring of agent health with early anomaly detection is essential.
Key Metrics for AI Agent Monitoring
AI agent monitoring demands different metrics than traditional web services.
Before deploying these patterns to production, there are several practical considerations that documentation rarely covers.
Tuning Circuit Breaker Thresholds
The default thresholds in this guide are starting points, not production values. Your optimal failureThreshold depends on your traffic volume and acceptable error rate. A high-traffic agent processing thousands of requests per hour might set the threshold at 10-15 failures within a 30-second window. A low-traffic agent handling a few requests per minute might trip at just 2-3 failures.
The resetTimeoutMs should reflect how long your dependencies typically take to recover. For cloud LLM APIs like Gemini, 30-60 seconds is usually appropriate — rate limit windows typically reset in that range. For local models running on Ollama, recovery is often faster (restart takes 5-10 seconds), so a shorter timeout works.
Cost Implications of Fallback Chains
Model fallback has a direct cost impact that is easy to overlook. If your primary model (Gemini Pro) costs $7 per million tokens and your fallback (Gemini Flash) costs $0.30 per million tokens, a sustained fallback period actually saves money. But if your fallback is a premium model from another provider, costs can spike unexpectedly.
Track estimatedCostUsd in your metrics and set budget alerts. I have seen teams hit their monthly API budget in a single week because a circuit breaker misconfiguration kept routing traffic to an expensive fallback model.
Checkpoint Storage in Serverless Environments
If you are deploying agents on Cloudflare Workers, Vercel Edge Functions, or similar serverless platforms, filesystem-based checkpoints will not work since there is no persistent local storage. Use Cloudflare KV, Upstash Redis, or DynamoDB as your checkpoint backend instead. The CheckpointManager class is designed with this swap in mind — replace the readFile/writeFile calls with your storage client of choice.
Your Next Step
You could copy all the code from this guide into your project right now, but I'd recommend starting with the single highest-impact pattern. Most production agent failures come from calling LLM APIs without any retry logic at all. Add the withRetry function to your project and wrap every LLM API call with it. That alone will noticeably reduce your middle-of-the-night incident count.
Circuit breakers and fallback chains make sense when you move to multi-agent architectures. For a single-agent setup, the added complexity isn't justified. Add these patterns incrementally, as the need arises — that's the practical approach.
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.