ANTIGRAVITY LABJP
Articles/Agents & Manager
Agents & Manager/2026-03-30Advanced

Antigravity × Durable Execution: Designing Fault-Tolerant Long-Running AI Tasks

An implementation guide for putting Antigravity agents to work on long-running jobs using Durable Execution — covering checkpointing, idempotency, and automatic retries, plus three real incidents from a 50M-download indie app and seven production pitfalls the official docs never spell out.

antigravity402durable-execution2ai-agent16typescript26fault-toleranceworkflow48

Premium Article

One morning the pipeline that aggregates yesterday's AdMob revenue died ten minutes from the finish line — an hour and a half of work, gone. I gave up on the spot and re-kicked the same job from my phone on the train into the office. I've been shipping indie apps since 2014, and the family of wallpaper and ambient apps I run has now passed 50 million downloads. Every morning that revenue gets reconciled between AdMob, App Store Connect, and Google Play. Before I introduced Durable Execution, the "almost made it" failure mode was just brutal: AdMob rate limits, Cloudflare Workers timeouts, the occasional Supabase blip — each of them meant starting from scratch.

I'm Masaki Hirokawa (@dolice). Alongside my artist practice, I've been pushing Antigravity's agent features into long-running jobs in my indie developer workflow, and Durable Execution has been the design pattern that pays back hardest. This guide walks through the implementation code, three real incidents from running a 50M-download portfolio, and seven production pitfalls the official docs never spell out — the kind of material I think actually earns being behind the membership wall.

The Three Principles That Make Durable Execution Work

Before any code, the three principles. Everything else falls out of these — and they map cleanly onto Antigravity agent design too.

Checkpointing for State Persistence

Every time a workflow step completes successfully, its result is saved to durable storage — a database, a queue, Cloudflare KV, anything that survives a restart. If the process crashes, the next run resumes from the most recent checkpoint instead of replaying every step. For my AdMob aggregation pipeline, this single change shrank average recovery time from 47 minutes to 4 minutes — roughly an 11x improvement.

Idempotency by Design

Operations must produce the same result whether they're called once or ten times. Without this, retries lead to duplicate writes, duplicate emails, double-charged customers. Missing idempotency in payment or notification paths is the kind of bug you only catch by getting publicly embarrassed.

Automatic Retry with Backoff

Transient failures (network timeouts, rate limits) should be retried with exponential backoff. Permanent failures should bubble up. The distinction matters: for AdMob, UNAVAILABLE and RESOURCE_EXHAUSTED are retryable, but PERMISSION_DENIED should fail fast — don't burn quota on something that will never succeed.

Implementing the Minimal Durable Workflow in Antigravity

Let's build a TypeScript Durable Execution skeleton that fetches data from an external API, transforms it, and persists the result.

// durable-workflow.ts
// Minimal implementation of the Durable Execution pattern
 
interface WorkflowState {
  currentStep: string;
  completedSteps: string[];
  data: Record<string, unknown>;
  retryCount: number;
  lastCheckpoint: string;
}
 
class DurableWorkflow {
  private state: WorkflowState;
  private storePath: string;
 
  constructor(workflowId: string) {
    this.storePath = `./workflow-state/${workflowId}.json`;
    this.state = this.loadState();
  }
 
  // Restore state from a checkpoint
  private loadState(): WorkflowState {
    try {
      const fs = require("fs");
      if (fs.existsSync(this.storePath)) {
        const saved = JSON.parse(fs.readFileSync(this.storePath, "utf-8"));
        console.log(`Resuming from checkpoint: ${saved.lastCheckpoint}`);
        return saved;
      }
    } catch (e) {
      console.warn("No checkpoint found, starting fresh");
    }
    return {
      currentStep: "init",
      completedSteps: [],
      data: {},
      retryCount: 0,
      lastCheckpoint: "",
    };
  }
 
  // Persist a checkpoint
  private saveCheckpoint(stepName: string): void {
    const fs = require("fs");
    const path = require("path");
    fs.mkdirSync(path.dirname(this.storePath), { recursive: true });
    this.state.lastCheckpoint = stepName;
    this.state.completedSteps.push(stepName);
    fs.writeFileSync(this.storePath, JSON.stringify(this.state, null, 2));
    console.log(`Checkpoint saved: ${stepName}`);
  }
 
  // Run a step idempotently
  async executeStep<T>(
    stepName: string,
    fn: () => Promise<T>,
    maxRetries = 3
  ): Promise<T> {
    // Skip completed steps
    if (this.state.completedSteps.includes(stepName)) {
      console.log(`Skipping completed step: ${stepName}`);
      return this.state.data[stepName] as T;
    }
 
    this.state.currentStep = stepName;
 
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
      try {
        const result = await fn();
        this.state.data[stepName] = result;
        this.saveCheckpoint(stepName);
        return result;
      } catch (error) {
        const waitMs = Math.min(1000 * Math.pow(2, attempt), 30000);
        console.error(
          `Step "${stepName}" failed (attempt ${attempt}/${maxRetries}):`,
          error
        );
        if (attempt === maxRetries) throw error;
        console.log(`Retrying in ${waitMs}ms...`);
        await new Promise((r) => setTimeout(r, waitMs));
      }
    }
    throw new Error(`Step "${stepName}" exhausted all retries`);
  }
}
 
// Example: a multi-step data transformation pipeline
async function runDataPipeline() {
  const workflow = new DurableWorkflow("data-pipeline-001");
 
  // Step 1: fetch (resumes from checkpoint on failure)
  const rawData = await workflow.executeStep("fetch-data", async () => {
    const res = await fetch("https://api.example.com/large-dataset");
    return res.json();
    // Expected output: { records: [...], total: 10000 }
  });
 
  // Step 2: transform
  const transformed = await workflow.executeStep("transform", async () => {
    return rawData.records.map((r: any) => ({
      id: r.id,
      normalized: r.value.toLowerCase().trim(),
      processedAt: new Date().toISOString(),
    }));
    // Expected output: [{ id: "1", normalized: "...", processedAt: "..." }, ...]
  });
 
  // Step 3: batch save
  const result = await workflow.executeStep("save-batch", async () => {
    const batchSize = 100;
    let saved = 0;
    for (let i = 0; i < transformed.length; i += batchSize) {
      const batch = transformed.slice(i, i + batchSize);
      await saveBatch(batch); // idempotent upsert
      saved += batch.length;
    }
    return { totalSaved: saved };
    // Expected output: { totalSaved: 10000 }
  });
 
  console.log("Pipeline complete:", result);
}
 
async function saveBatch(batch: any[]) {
  // Idempotent upsert keyed by record ID
  console.log(`Saving batch of ${batch.length} records`);
}
 
runDataPipeline().catch(console.error);

Each step saves a checkpoint to the filesystem on success. When the process restarts, completed steps are skipped automatically and execution resumes from where it stopped. In a serverless environment local FS is ephemeral, so swap the storage backend for KV or a relational DB before going to production.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Three real AdMob pipeline incidents from a 50M-download indie app — recovery time cut from 47 minutes to 4 minutes
Seven production pitfalls the official docs never spell out (checkpoint bloat, double notifications, rate-limit blast radius, partial recovery, time drift, cold start, monitoring blind spots)
A four-step prompt template for getting Antigravity to generate durable workflows that actually survive real failure modes
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Agents & Manager2026-03-26
Antigravity Multi-Agent Workflow Guide — Accelerate Development with Multiple AIs
Learn how to use Antigravity's multi-agent features to dramatically speed up your development workflow. From Manager Surface basics to advanced agent coordination patterns.
Agents & Manager2026-06-21
Letting a Background Agent Work Overnight Without Regretting It by Morning — Guardrails for Unattended Runs
When you hand overnight refactoring to Antigravity's Background Agent, the morning brings as much anxiety as convenience. From three angles — blast radius, completion criteria, and detecting silent regressions — here are the guardrails that let me run unattended jobs with confidence.
Agents & Manager2026-06-18
Three Boundaries I Draw Before Handing Work to an Antigravity 2.0 Agent
What to hand a background agent, and what to keep in your own hands. The three boundaries I actually drew while running solo-dev automation in parallel, and how to encode them so the lines hold.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →