Articles/Agents & Manager

◈ Agents & Manager/2026-04-13Advanced

Resilient AI Agents in Antigravity — Retry, Circuit Breakers, and Fallback Strategies for Production

Build fault-tolerant AI agents in Antigravity with production-grade retry strategies, circuit breakers, model fallback chains, and checkpoint recovery. Every pattern you need to keep agents running reliably.

antigravity⁴³⁶ agents¹²⁹ error-handling² production⁷¹ resilience⁹ circuit-breaker³ retry⁸

✦ Premium Article

Why AI Agents Break in Production

Everything works fine when you're testing your Antigravity AI agents in development. Then you deploy to production, and things break in ways you never imagined. The LLM API returns a 429. Responses get cut off mid-stream. Token limits get hit and the agent halts in an inconsistent state.

I confronted this problem head-on three days after deploying a multi-agent pipeline to production. At 2 AM, Gemini API rate limits kicked in and five agents failed in a cascade. Recovery took four hours.

This guide shares every resilience pattern I built after that incident. Circuit breakers, retry strategies, model fallback, checkpoint design — everything you need to keep AI agents stable in production, with working code you can drop into your Antigravity project.

Designing Retry Strategies — Exponential Backoff and Jitter Done Right

AI agent API calls demand different retry strategies than traditional web services. LLM APIs take seconds to tens of seconds to respond, so naive retries cause wait times to explode. When multiple agents retry simultaneously, you get "retry storms" that make the API situation worse.

Why Fixed-Interval Retries Fail

Fixed-interval retries (say, 3 seconds every time) cause all agents to hit the API at the same instant during an outage. This is the classic "Thundering Herd" problem, and it actively delays API recovery.

Exponential Backoff with Jitter is the standard solution. You increase wait time exponentially while adding randomness to spread retries across time.

// Retry utility for Antigravity projects
// Wraps agent API calls with automatic retry logic
 
interface RetryConfig {
  maxRetries: number;        // Maximum retry attempts
  baseDelayMs: number;       // Initial wait time (milliseconds)
  maxDelayMs: number;        // Wait time ceiling
  jitterFactor: number;      // Jitter coefficient (0–1)
  retryableErrors: number[]; // HTTP status codes eligible for retry
}
 
const DEFAULT_CONFIG: RetryConfig = {
  maxRetries: 5,
  baseDelayMs: 1000,
  maxDelayMs: 60000,
  jitterFactor: 0.5,
  retryableErrors: [429, 500, 502, 503, 504],
};
 
async function withRetry<T>(
  operation: () => Promise<T>,
  config: Partial<RetryConfig> = {}
): Promise<T> {
  const cfg = { ...DEFAULT_CONFIG, ...config };
  let lastError: Error | null = null;
 
  for (let attempt = 0; attempt <= cfg.maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error: unknown) {
      lastError = error instanceof Error ? error : new Error(String(error));
 
      // Non-retryable errors fail immediately
      const status = (error as { status?: number }).status;
      if (status && \!cfg.retryableErrors.includes(status)) {
        throw error;
      }
 
      if (attempt === cfg.maxRetries) {
        break; // Exhausted all retries
      }
 
      // Exponential Backoff + Full Jitter
      const exponentialDelay = cfg.baseDelayMs * Math.pow(2, attempt);
      const cappedDelay = Math.min(exponentialDelay, cfg.maxDelayMs);
      const jitter = cappedDelay * cfg.jitterFactor * Math.random();
      const finalDelay = cappedDelay - (cappedDelay * cfg.jitterFactor / 2) + jitter;
 
      console.warn(
        `[Retry ${attempt + 1}/${cfg.maxRetries}] ` +
        `Waiting ${Math.round(finalDelay)}ms before retry. ` +
        `Error: ${lastError.message}`
      );
 
      await new Promise(resolve => setTimeout(resolve, finalDelay));
    }
  }
 
  throw new Error(
    `Operation failed after ${cfg.maxRetries} retries. Last error: ${lastError?.message}`
  );
}
 
// Usage: Apply retry logic to a Gemini API call
const response = await withRetry(
  () => fetch('https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-pro:generateContent', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'x-goog-api-key': process.env.GEMINI_API_KEY ?? '',
    },
    body: JSON.stringify({ contents: [{ parts: [{ text: prompt }] }] }),
  }),
  { maxRetries: 3, baseDelayMs: 2000 }
);
// Expected: On 429, retries at ~2s → ~4s → ~8s intervals, returns response on success

Retries and Idempotency

Here's a subtlety that's easy to miss. Retrying means potentially executing the same operation multiple times. LLM queries themselves are idempotent, but when agents perform side effects — writing files, calling external APIs — retries can cause duplicate executions.

The solution is recording checkpoints before executing actions, then checking "was this already done?" on retry. We'll cover this in detail in a later section.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦You'll be able to solve the problem of agents silently failing in production by implementing circuit breakers and intelligent retry strategies

✦You'll master model fallback chains, checkpoint recovery, and graceful degradation patterns that you can apply to your own products immediately

✦You'll build a monitoring foundation that detects and recovers from agent failures automatically — no more 2 AM incident response

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Circuit Breakers — Stopping Cascading Failures

Retries alone keep hammering a service that's fundamentally down. A circuit breaker automatically blocks requests after a threshold of consecutive failures, giving the service time to recover.

Three States

A circuit breaker has three states:

Closed: Normal operation. Requests pass through. Failures are counted, and the breaker transitions to Open when the threshold is hit.

Open: Blocked. Requests are immediately rejected and fallback logic executes. After a timeout period, the breaker transitions to Half-Open.

Half-Open: A single test request is allowed through. If it succeeds, the breaker returns to Closed. If it fails, it returns to Open.

// Circuit breaker for AI agent API and tool calls
// Automatically blocks requests to failing dependencies
 
type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';
 
interface CircuitBreakerConfig {
  failureThreshold: number;   // Failures needed to trip to Open
  resetTimeoutMs: number;     // Wait time before trying Half-Open
  halfOpenMaxAttempts: number; // Test requests allowed in Half-Open
  monitorWindowMs: number;    // Sliding window for failure counting
}
 
class CircuitBreaker {
  private state: CircuitState = 'CLOSED';
  private failures: number[] = []; // Timestamp array
  private lastOpenTime = 0;
  private halfOpenAttempts = 0;
  private consecutiveSuccesses = 0;
 
  constructor(
    private name: string,
    private config: CircuitBreakerConfig = {
      failureThreshold: 5,
      resetTimeoutMs: 30000,
      halfOpenMaxAttempts: 3,
      monitorWindowMs: 60000,
    }
  ) {}
 
  async execute<T>(
    operation: () => Promise<T>,
    fallback?: () => Promise<T>
  ): Promise<T> {
    // Evict stale failures outside the monitoring window
    const now = Date.now();
    this.failures = this.failures.filter(
      t => now - t < this.config.monitorWindowMs
    );
 
    if (this.state === 'OPEN') {
      if (now - this.lastOpenTime >= this.config.resetTimeoutMs) {
        this.state = 'HALF_OPEN';
        this.halfOpenAttempts = 0;
        this.consecutiveSuccesses = 0;
        console.info(`[CircuitBreaker:${this.name}] OPEN → HALF_OPEN`);
      } else {
        console.warn(`[CircuitBreaker:${this.name}] Circuit OPEN — executing fallback`);
        if (fallback) return fallback();
        throw new Error(`Circuit breaker "${this.name}" is OPEN`);
      }
    }
 
    if (this.state === 'HALF_OPEN' && this.halfOpenAttempts >= this.config.halfOpenMaxAttempts) {
      this.transitionToOpen();
      if (fallback) return fallback();
      throw new Error(`Circuit breaker "${this.name}" — half-open attempts exhausted`);
    }
 
    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      if (this.state === 'OPEN' && fallback) {
        return fallback();
      }
      throw error;
    }
  }
 
  private onSuccess(): void {
    if (this.state === 'HALF_OPEN') {
      this.consecutiveSuccesses++;
      this.halfOpenAttempts++;
      if (this.consecutiveSuccesses >= 2) {
        this.state = 'CLOSED';
        this.failures = [];
        console.info(`[CircuitBreaker:${this.name}] HALF_OPEN → CLOSED (recovered)`);
      }
    }
  }
 
  private onFailure(): void {
    this.failures.push(Date.now());
 
    if (this.state === 'HALF_OPEN') {
      this.transitionToOpen();
      return;
    }
 
    if (this.failures.length >= this.config.failureThreshold) {
      this.transitionToOpen();
    }
  }
 
  private transitionToOpen(): void {
    this.state = 'OPEN';
    this.lastOpenTime = Date.now();
    console.warn(
      `[CircuitBreaker:${this.name}] → OPEN ` +
      `(failures: ${this.failures.length}, timeout: ${this.config.resetTimeoutMs}ms)`
    );
  }
 
  getState(): CircuitState {
    return this.state;
  }
}
 
// Usage: Separate breakers for each dependency
const geminiBreaker = new CircuitBreaker('gemini-api', {
  failureThreshold: 3,
  resetTimeoutMs: 30000,
  halfOpenMaxAttempts: 2,
  monitorWindowMs: 60000,
});
 
const gemmaBreaker = new CircuitBreaker('gemma-local', {
  failureThreshold: 2,        // Local model failures are often severe
  resetTimeoutMs: 10000,      // Recovery is faster for local models
  halfOpenMaxAttempts: 1,
  monitorWindowMs: 30000,
});
// Expected: geminiBreaker opens after 3 failures within 60s,
// waits 30s, then tests with Half-Open. 2 consecutive successes → Closed.

Why Each Dependency Needs Its Own Circuit Breaker

In Antigravity's multi-agent architecture, each agent depends on different APIs and tools. A single shared circuit breaker means one API's outage takes down agents that don't even use that API. Isolate breakers by dependency — always.

Model Fallback — Having a "Plan B" for Your AI

The most effective defense against LLM API outages is falling back to a different model. Antigravity supports switching between multiple AI models, and this guide shows you how to exploit that capability to the fullest.

Designing a Fallback Chain

Model fallback introduces a quality-versus-availability tradeoff. You need to decide in advance how much quality you're willing to sacrifice when your primary model is unavailable.

// Model fallback chain
// Tries models from highest to lowest quality, returns first successful response
 
interface ModelConfig {
  id: string;
  endpoint: string;
  apiKeyEnv: string;
  maxTokens: number;
  qualityTier: 'premium' | 'standard' | 'fallback';
  circuitBreaker: CircuitBreaker;
}
 
interface FallbackResult<T> {
  data: T;
  modelUsed: string;
  qualityTier: string;
  wasFallback: boolean;
  attemptedModels: string[];
}
 
class ModelFallbackChain {
  private models: ModelConfig[];
 
  constructor(models: ModelConfig[]) {
    const tierOrder = { premium: 0, standard: 1, fallback: 2 };
    this.models = [...models].sort(
      (a, b) => tierOrder[a.qualityTier] - tierOrder[b.qualityTier]
    );
  }
 
  async generate(prompt: string): Promise<FallbackResult<string>> {
    const attemptedModels: string[] = [];
 
    for (const model of this.models) {
      attemptedModels.push(model.id);
 
      try {
        const result = await model.circuitBreaker.execute(
          async () => {
            const apiKey = process.env[model.apiKeyEnv];
            if (\!apiKey) throw new Error(`Missing API key: ${model.apiKeyEnv}`);
 
            const response = await withRetry(
              () => fetch(model.endpoint, {
                method: 'POST',
                headers: {
                  'Content-Type': 'application/json',
                  'Authorization': `Bearer ${apiKey}`,
                },
                body: JSON.stringify({
                  model: model.id,
                  messages: [{ role: 'user', content: prompt }],
                  max_tokens: model.maxTokens,
                }),
              }),
              { maxRetries: 2, baseDelayMs: 1000 }
            );
 
            if (\!response.ok) {
              const err = new Error(`API error: ${response.status}`);
              (err as Error & { status: number }).status = response.status;
              throw err;
            }
 
            const data = await response.json();
            return data.choices?.[0]?.message?.content ?? data.candidates?.[0]?.content?.parts?.[0]?.text ?? '';
          },
          undefined
        );
 
        return {
          data: result,
          modelUsed: model.id,
          qualityTier: model.qualityTier,
          wasFallback: attemptedModels.length > 1,
          attemptedModels,
        };
      } catch (error) {
        console.warn(
          `[Fallback] ${model.id} failed (${model.qualityTier}): ${(error as Error).message}`
        );
        continue;
      }
    }
 
    throw new Error(
      `All models failed. Attempted: ${attemptedModels.join(' → ')}`
    );
  }
}
 
// Usage: Gemini Pro → Gemma 4 local → Gemini Flash fallback chain
const chain = new ModelFallbackChain([
  {
    id: 'gemini-2.5-pro',
    endpoint: 'https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-pro:generateContent',
    apiKeyEnv: 'GEMINI_API_KEY',
    maxTokens: 8192,
    qualityTier: 'premium',
    circuitBreaker: geminiBreaker,
  },
  {
    id: 'gemma-4-local',
    endpoint: 'http://localhost:11434/api/generate',
    apiKeyEnv: 'OLLAMA_DUMMY_KEY',
    maxTokens: 4096,
    qualityTier: 'standard',
    circuitBreaker: gemmaBreaker,
  },
  {
    id: 'gemini-2.0-flash',
    endpoint: 'https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent',
    apiKeyEnv: 'GEMINI_API_KEY',
    maxTokens: 4096,
    qualityTier: 'fallback',
    circuitBreaker: new CircuitBreaker('gemini-flash'),
  },
]);
 
const result = await chain.generate('Explain the circuit breaker pattern');
console.log(`Model used: ${result.modelUsed} (${result.qualityTier})`);
console.log(`Was fallback: ${result.wasFallback}`);
// Expected (when Gemini Pro returns 429):
// [Fallback] gemini-2.5-pro failed (premium): API error: 429
// Model used: gemma-4-local (standard)
// Was fallback: true

Making Fallback Quality Degradation Visible

If your team doesn't know fallbacks are happening, quality silently degrades without anyone noticing. Record fallback events as metrics and surface them on your dashboard — this is non-negotiable for production systems.

Checkpoint Design — Recording "How Far Did We Get?"

When an AI agent is executing a long-running task and fails midway, starting over from scratch wastes time and resources. Checkpoints let you resume from the last successful step.

Implementation Pattern: Step-Based Checkpoints

// Checkpoint manager for agent tasks
// Persists completion state for each step, enabling resume after failure
 
import { readFile, writeFile, mkdir } from 'fs/promises';
import { join } from 'path';
 
interface CheckpointData {
  taskId: string;
  currentStep: number;
  totalSteps: number;
  completedSteps: Record<string, {
    result: unknown;
    completedAt: string;
    durationMs: number;
  }>;
  metadata: Record<string, unknown>;
  lastUpdated: string;
}
 
class CheckpointManager {
  private checkpointDir: string;
 
  constructor(baseDir: string = '.antigravity/checkpoints') {
    this.checkpointDir = baseDir;
  }
 
  private getPath(taskId: string): string {
    return join(this.checkpointDir, `${taskId}.json`);
  }
 
  async load(taskId: string): Promise<CheckpointData | null> {
    try {
      const raw = await readFile(this.getPath(taskId), 'utf-8');
      return JSON.parse(raw) as CheckpointData;
    } catch {
      return null; // No checkpoint — start from the beginning
    }
  }
 
  async save(data: CheckpointData): Promise<void> {
    await mkdir(this.checkpointDir, { recursive: true });
    data.lastUpdated = new Date().toISOString();
    await writeFile(this.getPath(data.taskId), JSON.stringify(data, null, 2));
  }
 
  async markStepComplete(
    taskId: string,
    stepName: string,
    result: unknown,
    durationMs: number
  ): Promise<void> {
    const data = await this.load(taskId);
    if (\!data) throw new Error(`No checkpoint found for task: ${taskId}`);
 
    data.completedSteps[stepName] = {
      result,
      completedAt: new Date().toISOString(),
      durationMs,
    };
    data.currentStep++;
    await this.save(data);
  }
 
  isStepCompleted(checkpoint: CheckpointData, stepName: string): boolean {
    return stepName in checkpoint.completedSteps;
  }
}
 
// Usage: Multi-step code analysis agent with checkpoint recovery
async function runCodeAnalysisAgent(repoPath: string): Promise<void> {
  const checkpoints = new CheckpointManager();
  const taskId = `code-analysis-${repoPath.replace(/\//g, '-')}`;
 
  let checkpoint = await checkpoints.load(taskId);
  if (checkpoint) {
    console.log(
      `Resuming from step ${checkpoint.currentStep}/${checkpoint.totalSteps}`
    );
  } else {
    checkpoint = {
      taskId,
      currentStep: 0,
      totalSteps: 4,
      completedSteps: {},
      metadata: { repoPath },
      lastUpdated: new Date().toISOString(),
    };
    await checkpoints.save(checkpoint);
  }
 
  const steps = [
    { name: 'scan-files', fn: () => scanRepository(repoPath) },
    { name: 'analyze-deps', fn: () => analyzeDependencies(repoPath) },
    { name: 'security-audit', fn: () => runSecurityAudit(repoPath) },
    { name: 'generate-report', fn: () => generateReport(repoPath) },
  ];
 
  for (const step of steps) {
    if (checkpoints.isStepCompleted(checkpoint, step.name)) {
      console.log(`Skipping completed step: ${step.name}`);
      continue;
    }
 
    const start = Date.now();
    try {
      const result = await step.fn();
      await checkpoints.markStepComplete(
        taskId, step.name, result, Date.now() - start
      );
      console.log(`✅ ${step.name} completed in ${Date.now() - start}ms`);
    } catch (error) {
      console.error(`❌ ${step.name} failed: ${(error as Error).message}`);
      console.log('Checkpoint saved. Re-run to resume from this step.');
      throw error;
    }
  }
}
 
// Placeholder functions (replace with actual implementations)
async function scanRepository(path: string) { return { files: 150 }; }
async function analyzeDependencies(path: string) { return { deps: 42 }; }
async function runSecurityAudit(path: string) { return { issues: 3 }; }
async function generateReport(path: string) { return { reportPath: '/tmp/report.md' }; }
// Expected (second run, step 1 already completed):
// Resuming from step 1/4
// Skipping completed step: scan-files
// ✅ analyze-deps completed in 2341ms
// ✅ security-audit completed in 5102ms
// ✅ generate-report completed in 1203ms

The Checkpoint Staleness Trap

Checkpoints have a staleness problem. If external data changes after a checkpoint is saved, resuming from an old checkpoint creates inconsistencies between early and late steps. Set a TTL (time-to-live) on checkpoints and discard any that exceed it — restart from scratch when in doubt.

Graceful Degradation — Choose Reduced Service Over Total Failure

Even when an agent can't deliver 100% functionality, returning some value to users beats returning nothing. This is the principle of graceful degradation.

Three-Level Degradation Strategy

// Graceful degradation manager
// Automatically adjusts feature availability based on system health
 
type ServiceLevel = 'full' | 'degraded' | 'minimal';
 
interface DegradationPolicy {
  level: ServiceLevel;
  enabledFeatures: string[];
  disabledFeatures: string[];
  userMessage: string;
}
 
class GracefulDegradation {
  private policies: Record<ServiceLevel, DegradationPolicy> = {
    full: {
      level: 'full',
      enabledFeatures: ['ai-generation', 'code-analysis', 'multi-agent', 'streaming'],
      disabledFeatures: [],
      userMessage: '',
    },
    degraded: {
      level: 'degraded',
      enabledFeatures: ['ai-generation', 'code-analysis'],
      disabledFeatures: ['multi-agent', 'streaming'],
      userMessage: 'Running with reduced features. Basic AI generation and code analysis are available.',
    },
    minimal: {
      level: 'minimal',
      enabledFeatures: ['code-analysis'],
      disabledFeatures: ['ai-generation', 'multi-agent', 'streaming'],
      userMessage: 'Cannot reach AI APIs. Only local analysis is available.',
    },
  };
 
  private currentLevel: ServiceLevel = 'full';
 
  evaluate(circuitBreakers: Map<string, CircuitBreaker>): DegradationPolicy {
    const openBreakers = Array.from(circuitBreakers.entries())
      .filter(([_, cb]) => cb.getState() === 'OPEN')
      .map(([name]) => name);
 
    if (openBreakers.length === 0) {
      this.currentLevel = 'full';
    } else if (openBreakers.includes('gemini-api') && openBreakers.includes('gemma-local')) {
      this.currentLevel = 'minimal';
    } else {
      this.currentLevel = 'degraded';
    }
 
    const policy = this.policies[this.currentLevel];
    if (this.currentLevel \!== 'full') {
      console.warn(
        `[Degradation] Level: ${this.currentLevel} | ` +
        `Open breakers: ${openBreakers.join(', ')} | ` +
        `Disabled: ${policy.disabledFeatures.join(', ')}`
      );
    }
    return policy;
  }
 
  isFeatureEnabled(feature: string): boolean {
    return this.policies[this.currentLevel].enabledFeatures.includes(feature);
  }
}
 
// Usage
const degradation = new GracefulDegradation();
const breakers = new Map<string, CircuitBreaker>([
  ['gemini-api', geminiBreaker],
  ['gemma-local', gemmaBreaker],
]);
 
const policy = degradation.evaluate(breakers);
if (\!degradation.isFeatureEnabled('multi-agent')) {
  console.log('Multi-agent mode temporarily disabled');
  // Fall back to single-agent mode
}
// Expected (when geminiBreaker is OPEN):
// [Degradation] Level: degraded | Open breakers: gemini-api | Disabled: multi-agent, streaming
// Multi-agent mode temporarily disabled

Common Mistakes and Pitfalls

Pitfall 1: Thinking More Retries Will Fix Everything

Setting retries to 10 or 20 accomplishes nothing when the API is fundamentally down. Excessive retries actually delay API recovery by piling on load. Cap retries at around 5 and combine them with circuit breakers.

Pitfall 2: Not Distinguishing Error Types

Treating all errors with the same retry strategy is dangerous. A 429 (rate limit) will resolve with waiting, but a 401 (auth failure) or 400 (bad request) will never succeed no matter how many times you retry. Separate "retryable" from "fail immediately" based on error codes.

Pitfall 3: Ignoring Quality Differences Between Fallback Models

When you fall back from Gemini Pro to Gemini Flash, output quality can change significantly. Adapt your prompts during fallback — add more detailed instructions, enforce stricter output formats — to mitigate the quality drop.

Pitfall 4: Not Considering Checkpoint Storage

If you're writing checkpoints to the filesystem, consider disk capacity and I/O performance. Many agents writing checkpoints simultaneously can create I/O bottlenecks. For production, consider Redis or a KV store.

Pitfall 5: Never Simulating Failures in Test Environments

Implementing resilience patterns without testing them under actual failure conditions is pointless. Adopt Chaos Engineering practices — deliberately inject API errors and timeouts in your test suite.

// Fault injection test — verify circuit breaker transitions
// Runs in Antigravity's test runner
 
async function testCircuitBreakerTransition(): Promise<void> {
  const breaker = new CircuitBreaker('test-breaker', {
    failureThreshold: 3,
    resetTimeoutMs: 1000,
    halfOpenMaxAttempts: 2,
    monitorWindowMs: 5000,
  });
 
  let callCount = 0;
  const failingOperation = async () => {
    callCount++;
    throw new Error('Simulated API failure');
  };
 
  // Should transition to Open after 3 failures
  for (let i = 0; i < 3; i++) {
    try {
      await breaker.execute(failingOperation);
    } catch { /* expected */ }
  }
 
  console.assert(
    breaker.getState() === 'OPEN',
    `Expected OPEN, got ${breaker.getState()}`
  );
  console.log(`✅ Circuit opened after ${callCount} failures`);
 
  // During Open state, fallback should be called
  const fallbackResult = await breaker.execute(
    failingOperation,
    async () => 'fallback-value'
  );
  console.assert(
    fallbackResult === 'fallback-value',
    'Fallback should be returned when circuit is OPEN'
  );
  console.log('✅ Fallback executed during OPEN state');
 
  // After 1 second, should transition to Half-Open
  await new Promise(r => setTimeout(r, 1100));
  console.assert(
    breaker.getState() === 'OPEN', // Still OPEN until next execute() call
    'Should still be OPEN until next execute()'
  );
  console.log('✅ All circuit breaker transitions verified');
}
 
testCircuitBreakerTransition();
// Expected:
// ✅ Circuit opened after 3 failures
// ✅ Fallback executed during OPEN state
// ✅ All circuit breaker transitions verified

Monitoring and Alerting — Minimizing Time from Detection to Recovery

Even with resilience patterns in place, you can't improve what you can't see. Continuous monitoring of agent health with early anomaly detection is essential.

Key Metrics for AI Agent Monitoring

AI agent monitoring demands different metrics than traditional web services.

// AI agent monitoring metrics collection
// OpenTelemetry-compatible metrics recording
 
interface AgentMetrics {
  // Request level
  requestCount: number;
  errorCount: number;
  retryCount: number;
  fallbackCount: number;
 
  // Latency
  p50LatencyMs: number;
  p95LatencyMs: number;
  p99LatencyMs: number;
 
  // Token consumption
  inputTokensTotal: number;
  outputTokensTotal: number;
  estimatedCostUsd: number;
 
  // Circuit breakers
  circuitOpenEvents: number;
  circuitRecoveryEvents: number;
  meanTimeToRecoveryMs: number;
 
  // Checkpoints
  checkpointSaveCount: number;
  checkpointResumeCount: number;
}
 
class MetricsCollector {
  private latencies: number[] = [];
  private metrics: AgentMetrics = {
    requestCount: 0,
    errorCount: 0,
    retryCount: 0,
    fallbackCount: 0,
    p50LatencyMs: 0,
    p95LatencyMs: 0,
    p99LatencyMs: 0,
    inputTokensTotal: 0,
    outputTokensTotal: 0,
    estimatedCostUsd: 0,
    circuitOpenEvents: 0,
    circuitRecoveryEvents: 0,
    meanTimeToRecoveryMs: 0,
    checkpointSaveCount: 0,
    checkpointResumeCount: 0,
  };
 
  recordRequest(latencyMs: number, success: boolean): void {
    this.metrics.requestCount++;
    this.latencies.push(latencyMs);
    if (\!success) this.metrics.errorCount++;
    this.updatePercentiles();
  }
 
  recordRetry(): void { this.metrics.retryCount++; }
  recordFallback(): void { this.metrics.fallbackCount++; }
 
  recordTokenUsage(input: number, output: number, costPerMillionTokens: number): void {
    this.metrics.inputTokensTotal += input;
    this.metrics.outputTokensTotal += output;
    this.metrics.estimatedCostUsd +=
      ((input + output) / 1_000_000) * costPerMillionTokens;
  }
 
  private updatePercentiles(): void {
    const sorted = [...this.latencies].sort((a, b) => a - b);
    const len = sorted.length;
    if (len === 0) return;
    this.metrics.p50LatencyMs = sorted[Math.floor(len * 0.5)] ?? 0;
    this.metrics.p95LatencyMs = sorted[Math.floor(len * 0.95)] ?? 0;
    this.metrics.p99LatencyMs = sorted[Math.floor(len * 0.99)] ?? 0;
  }
 
  getSnapshot(): AgentMetrics {
    return { ...this.metrics };
  }
 
  evaluateAlerts(): string[] {
    const alerts: string[] = [];
    const errorRate = this.metrics.requestCount > 0
      ? this.metrics.errorCount / this.metrics.requestCount
      : 0;
 
    if (errorRate > 0.1) {
      alerts.push(`🔴 Error rate at ${(errorRate * 100).toFixed(1)}% (threshold: 10%)`);
    }
    if (this.metrics.p95LatencyMs > 30000) {
      alerts.push(`🟡 P95 latency at ${this.metrics.p95LatencyMs}ms (threshold: 30,000ms)`);
    }
    if (this.metrics.fallbackCount > this.metrics.requestCount * 0.2) {
      alerts.push(`🟠 Fallback rate exceeds 20% — check primary model health`);
    }
    return alerts;
  }
}
// Expected: Aggregates metrics and fires alerts when error rate > 10%,
// P95 latency > 30s, or fallback rate > 20%

Integrating All Patterns — A Production-Ready Agent Wrapper

Here's everything from this guide combined into a single agent wrapper you can use directly in your Antigravity project.

// Production-ready AI agent wrapper
// Integrates retry, circuit breaker, fallback, checkpoint, and monitoring
 
class ResilientAgent {
  private fallbackChain: ModelFallbackChain;
  private checkpoints: CheckpointManager;
  private degradation: GracefulDegradation;
  private metricsCollector: MetricsCollector;
  private circuitBreakers: Map<string, CircuitBreaker>;
 
  constructor(config: {
    models: ModelConfig[];
    checkpointDir?: string;
  }) {
    this.fallbackChain = new ModelFallbackChain(config.models);
    this.checkpoints = new CheckpointManager(config.checkpointDir);
    this.degradation = new GracefulDegradation();
    this.metricsCollector = new MetricsCollector();
    this.circuitBreakers = new Map(
      config.models.map(m => [m.id, m.circuitBreaker])
    );
  }
 
  async executeTask(taskId: string, steps: Array<{
    name: string;
    requiresAI: boolean;
    execute: (aiGenerate: (prompt: string) => Promise<string>) => Promise<unknown>;
  }>): Promise<{ success: boolean; results: Record<string, unknown> }> {
    const policy = this.degradation.evaluate(this.circuitBreakers);
    if (policy.level \!== 'full') {
      console.warn(`[Agent] Running in ${policy.level} mode: ${policy.userMessage}`);
    }
 
    let checkpoint = await this.checkpoints.load(taskId);
    if (\!checkpoint) {
      checkpoint = {
        taskId,
        currentStep: 0,
        totalSteps: steps.length,
        completedSteps: {},
        metadata: {},
        lastUpdated: new Date().toISOString(),
      };
      await this.checkpoints.save(checkpoint);
    }
 
    const results: Record<string, unknown> = {};
 
    for (const step of steps) {
      if (this.checkpoints.isStepCompleted(checkpoint, step.name)) {
        results[step.name] = checkpoint.completedSteps[step.name]?.result;
        continue;
      }
 
      if (step.requiresAI && \!this.degradation.isFeatureEnabled('ai-generation')) {
        console.warn(`[Agent] Skipping AI step "${step.name}" — AI unavailable`);
        continue;
      }
 
      const start = Date.now();
      try {
        const aiGenerate = async (prompt: string): Promise<string> => {
          const result = await this.fallbackChain.generate(prompt);
          if (result.wasFallback) this.metricsCollector.recordFallback();
          this.metricsCollector.recordRequest(Date.now() - start, true);
          return result.data;
        };
 
        const result = await step.execute(aiGenerate);
        const duration = Date.now() - start;
 
        await this.checkpoints.markStepComplete(taskId, step.name, result, duration);
        results[step.name] = result;
        console.log(`✅ [${step.name}] ${duration}ms`);
      } catch (error) {
        this.metricsCollector.recordRequest(Date.now() - start, false);
        const alerts = this.metricsCollector.evaluateAlerts();
        if (alerts.length > 0) {
          console.error('[Alerts]', alerts.join('\n'));
        }
        throw error;
      }
    }
 
    return { success: true, results };
  }
}
 
// Usage
const agent = new ResilientAgent({
  models: [
    /* ModelConfig array from earlier */
  ],
  checkpointDir: '.antigravity/checkpoints',
});
// This wrapper lets you define resilient agent tasks
// without worrying about individual pattern implementations

Real-World Deployment Considerations

Before deploying these patterns to production, there are several practical considerations that documentation rarely covers.

Tuning Circuit Breaker Thresholds

The default thresholds in this guide are starting points, not production values. Your optimal failureThreshold depends on your traffic volume and acceptable error rate. A high-traffic agent processing thousands of requests per hour might set the threshold at 10-15 failures within a 30-second window. A low-traffic agent handling a few requests per minute might trip at just 2-3 failures.

The resetTimeoutMs should reflect how long your dependencies typically take to recover. For cloud LLM APIs like Gemini, 30-60 seconds is usually appropriate — rate limit windows typically reset in that range. For local models running on Ollama, recovery is often faster (restart takes 5-10 seconds), so a shorter timeout works.

Cost Implications of Fallback Chains

Model fallback has a direct cost impact that is easy to overlook. If your primary model (Gemini Pro) costs $7 per million tokens and your fallback (Gemini Flash) costs $0.30 per million tokens, a sustained fallback period actually saves money. But if your fallback is a premium model from another provider, costs can spike unexpectedly.

Track estimatedCostUsd in your metrics and set budget alerts. I have seen teams hit their monthly API budget in a single week because a circuit breaker misconfiguration kept routing traffic to an expensive fallback model.

Checkpoint Storage in Serverless Environments

If you are deploying agents on Cloudflare Workers, Vercel Edge Functions, or similar serverless platforms, filesystem-based checkpoints will not work since there is no persistent local storage. Use Cloudflare KV, Upstash Redis, or DynamoDB as your checkpoint backend instead. The CheckpointManager class is designed with this swap in mind — replace the readFile/writeFile calls with your storage client of choice.

Your Next Step

You could copy all the code from this guide into your project right now, but I'd recommend starting with the single highest-impact pattern. Most production agent failures come from calling LLM APIs without any retry logic at all. Add the withRetry function to your project and wrap every LLM API call with it. That alone will noticeably reduce your middle-of-the-night incident count.

Circuit breakers and fallback chains make sense when you move to multi-agent architectures. For a single-agent setup, the added complexity isn't justified. Add these patterns incrementally, as the need arises — that's the practical approach.

For more on production AI agent design patterns, check out Common Pitfalls in Multi-Agent Design and How to Fix Them and The Complete Guide to Production-Ready AI Agent Design.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.