Articles/Agents & Manager

◈ Agents & Manager/2026-03-30Advanced

Antigravity × Durable Execution: Designing Fault-Tolerant Long-Running AI Tasks

An implementation guide for putting Antigravity agents to work on long-running jobs using Durable Execution — covering checkpointing, idempotency, and automatic retries, plus three real incidents from a 50M-download indie app and seven production pitfalls the official docs never spell out.

antigravity⁴⁰² durable-execution² ai-agent¹⁶ typescript²⁶ fault-tolerance workflow⁴⁸

✦ Premium Article

One morning the pipeline that aggregates yesterday's AdMob revenue died ten minutes from the finish line — an hour and a half of work, gone. I gave up on the spot and re-kicked the same job from my phone on the train into the office. I've been shipping indie apps since 2014, and the family of wallpaper and ambient apps I run has now passed 50 million downloads. Every morning that revenue gets reconciled between AdMob, App Store Connect, and Google Play. Before I introduced Durable Execution, the "almost made it" failure mode was just brutal: AdMob rate limits, Cloudflare Workers timeouts, the occasional Supabase blip — each of them meant starting from scratch.

I'm Masaki Hirokawa (@dolice). Alongside my artist practice, I've been pushing Antigravity's agent features into long-running jobs in my indie developer workflow, and Durable Execution has been the design pattern that pays back hardest. This guide walks through the implementation code, three real incidents from running a 50M-download portfolio, and seven production pitfalls the official docs never spell out — the kind of material I think actually earns being behind the membership wall.

The Three Principles That Make Durable Execution Work

Before any code, the three principles. Everything else falls out of these — and they map cleanly onto Antigravity agent design too.

Checkpointing for State Persistence

Every time a workflow step completes successfully, its result is saved to durable storage — a database, a queue, Cloudflare KV, anything that survives a restart. If the process crashes, the next run resumes from the most recent checkpoint instead of replaying every step. For my AdMob aggregation pipeline, this single change shrank average recovery time from 47 minutes to 4 minutes — roughly an 11x improvement.

Idempotency by Design

Operations must produce the same result whether they're called once or ten times. Without this, retries lead to duplicate writes, duplicate emails, double-charged customers. Missing idempotency in payment or notification paths is the kind of bug you only catch by getting publicly embarrassed.

Automatic Retry with Backoff

Transient failures (network timeouts, rate limits) should be retried with exponential backoff. Permanent failures should bubble up. The distinction matters: for AdMob, UNAVAILABLE and RESOURCE_EXHAUSTED are retryable, but PERMISSION_DENIED should fail fast — don't burn quota on something that will never succeed.

Implementing the Minimal Durable Workflow in Antigravity

Let's build a TypeScript Durable Execution skeleton that fetches data from an external API, transforms it, and persists the result.

// durable-workflow.ts
// Minimal implementation of the Durable Execution pattern
 
interface WorkflowState {
  currentStep: string;
  completedSteps: string[];
  data: Record<string, unknown>;
  retryCount: number;
  lastCheckpoint: string;
}
 
class DurableWorkflow {
  private state: WorkflowState;
  private storePath: string;
 
  constructor(workflowId: string) {
    this.storePath = `./workflow-state/${workflowId}.json`;
    this.state = this.loadState();
  }
 
  // Restore state from a checkpoint
  private loadState(): WorkflowState {
    try {
      const fs = require("fs");
      if (fs.existsSync(this.storePath)) {
        const saved = JSON.parse(fs.readFileSync(this.storePath, "utf-8"));
        console.log(`Resuming from checkpoint: ${saved.lastCheckpoint}`);
        return saved;
      }
    } catch (e) {
      console.warn("No checkpoint found, starting fresh");
    }
    return {
      currentStep: "init",
      completedSteps: [],
      data: {},
      retryCount: 0,
      lastCheckpoint: "",
    };
  }
 
  // Persist a checkpoint
  private saveCheckpoint(stepName: string): void {
    const fs = require("fs");
    const path = require("path");
    fs.mkdirSync(path.dirname(this.storePath), { recursive: true });
    this.state.lastCheckpoint = stepName;
    this.state.completedSteps.push(stepName);
    fs.writeFileSync(this.storePath, JSON.stringify(this.state, null, 2));
    console.log(`Checkpoint saved: ${stepName}`);
  }
 
  // Run a step idempotently
  async executeStep<T>(
    stepName: string,
    fn: () => Promise<T>,
    maxRetries = 3
  ): Promise<T> {
    // Skip completed steps
    if (this.state.completedSteps.includes(stepName)) {
      console.log(`Skipping completed step: ${stepName}`);
      return this.state.data[stepName] as T;
    }
 
    this.state.currentStep = stepName;
 
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
      try {
        const result = await fn();
        this.state.data[stepName] = result;
        this.saveCheckpoint(stepName);
        return result;
      } catch (error) {
        const waitMs = Math.min(1000 * Math.pow(2, attempt), 30000);
        console.error(
          `Step "${stepName}" failed (attempt ${attempt}/${maxRetries}):`,
          error
        );
        if (attempt === maxRetries) throw error;
        console.log(`Retrying in ${waitMs}ms...`);
        await new Promise((r) => setTimeout(r, waitMs));
      }
    }
    throw new Error(`Step "${stepName}" exhausted all retries`);
  }
}
 
// Example: a multi-step data transformation pipeline
async function runDataPipeline() {
  const workflow = new DurableWorkflow("data-pipeline-001");
 
  // Step 1: fetch (resumes from checkpoint on failure)
  const rawData = await workflow.executeStep("fetch-data", async () => {
    const res = await fetch("https://api.example.com/large-dataset");
    return res.json();
    // Expected output: { records: [...], total: 10000 }
  });
 
  // Step 2: transform
  const transformed = await workflow.executeStep("transform", async () => {
    return rawData.records.map((r: any) => ({
      id: r.id,
      normalized: r.value.toLowerCase().trim(),
      processedAt: new Date().toISOString(),
    }));
    // Expected output: [{ id: "1", normalized: "...", processedAt: "..." }, ...]
  });
 
  // Step 3: batch save
  const result = await workflow.executeStep("save-batch", async () => {
    const batchSize = 100;
    let saved = 0;
    for (let i = 0; i < transformed.length; i += batchSize) {
      const batch = transformed.slice(i, i + batchSize);
      await saveBatch(batch); // idempotent upsert
      saved += batch.length;
    }
    return { totalSaved: saved };
    // Expected output: { totalSaved: 10000 }
  });
 
  console.log("Pipeline complete:", result);
}
 
async function saveBatch(batch: any[]) {
  // Idempotent upsert keyed by record ID
  console.log(`Saving batch of ${batch.length} records`);
}
 
runDataPipeline().catch(console.error);

Each step saves a checkpoint to the filesystem on success. When the process restarts, completed steps are skipped automatically and execution resumes from where it stopped. In a serverless environment local FS is ephemeral, so swap the storage backend for KV or a relational DB before going to production.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Three real AdMob pipeline incidents from a 50M-download indie app — recovery time cut from 47 minutes to 4 minutes

✦Seven production pitfalls the official docs never spell out (checkpoint bloat, double notifications, rate-limit blast radius, partial recovery, time drift, cold start, monitoring blind spots)

✦A four-step prompt template for getting Antigravity to generate durable workflows that actually survive real failure modes

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Three Real Incidents the AdMob Pipeline Survived Thanks to Durable Execution

Time for the actual war stories. The revenue analytics pipeline for my wallpaper apps (50M+ downloads cumulative) kicks off every morning at 04:30 JST and runs the following:

Fetch country-level eCPM, impressions, and revenue from the AdMob Reporting API (~11 minutes)
Fetch install counts from App Store Connect and Google Play Console (~6 minutes)
Upsert into the Supabase daily_revenue table (~2 minutes)
Run anomaly detection (flag any country that moved more than ±30% week-over-week)
Post the daily report to Slack

Under the old architecture, a failure in any single step meant a full rerun — 47 minutes. With Durable Execution, only the failed step and its downstream re-run, and average recovery time drops to around 4 minutes. Three incidents I remember vividly:

Incident 1: AdMob `RESOURCE_EXHAUSTED` at the 36-Minute Mark

The AdMob Reporting API returns RESOURCE_EXHAUSTED when the daily quota gets hit. We were 36 minutes in, Step 1 almost done. Under the old design this would have been a 70-plus-minute disaster. With Durable Execution, the checkpoint was sliced per country chunk, so I just re-fetched the 12 missing countries in the next quota window — total downtime 8 minutes.

Incident 2: App Store Connect OAuth Token Expiry

The JWT token for App Store Connect has a 20-minute lifetime. I forgot to refresh it and Step 2 died with a 401 partway through. Because checkpoints were keyed per app ID, I just regenerated the token and re-fetched the remaining apps — 16-minute interruption, not 90 minutes. Lesson learned: token generation is now its own first executeStep, and the token gets stored in state.data so the rest of the workflow shares it.

Incident 3: Supabase Connection Pool Exhaustion

I'd cranked up parallelism on the anomaly detection job (Step 4) and exhausted the Supabase connection pool. Old world: rerun from Step 1. With Durable Execution, Steps 1–3 stayed cached, I dropped Step 4's parallelism from 12 to 4, and the entire recovery took 3 minutes.

None of these are the kind of failures you'll find in the SDK README. They're the failures you only meet after you go live. How you carve up checkpoint granularity ahead of time will change your recovery time by a factor of 10.

Trigger.dev for Production-Grade Durable Execution

In production, rolling your own checkpoint manager is a tax you'll regret. Trigger.dev is a TypeScript-native Durable Execution platform that pairs beautifully with Antigravity. For my own setup, I/O-heavy jobs like AdMob aggregation live on Trigger.dev, while lightweight anomaly batches run on Cloudflare Workers + KV. Picking the right tool per job matters more than picking the "best" one overall.

// trigger/ai-processing.ts
// Durable Execution workflow with Trigger.dev v3
 
import { task, wait, retry } from "@trigger.dev/sdk/v3";
 
export const processLargeCodebase = task({
  id: "process-large-codebase",
  // Max execution: 2 hours
  maxDuration: 7200,
  retry: {
    maxAttempts: 5,
    factor: 2,
    minTimeoutInMs: 1000,
    maxTimeoutInMs: 60000,
  },
  run: async (payload: { repoUrl: string; branch: string }) => {
    // Step 1: clone (auto-checkpointed)
    const cloneResult = await cloneRepo(payload.repoUrl, payload.branch);
    // Trigger.dev persists state automatically
 
    // Step 2: per-file analysis (batched)
    const files = await listSourceFiles(cloneResult.path);
    const analysisResults = [];
 
    for (const file of files) {
      // Each file result is checkpointed automatically
      const analysis = await retry.onThrow(
        async () => {
          return await analyzeFileWithAI(file);
        },
        { maxAttempts: 3, randomize: true }
      );
      analysisResults.push(analysis);
    }
 
    // Step 3: report generation
    const report = generateReport(analysisResults);
 
    // Step 4: notification
    await sendNotification({
      type: "analysis-complete",
      summary: report.summary,
      issuesFound: report.issues.length,
    });
 
    return report;
  },
});
 
async function cloneRepo(url: string, branch: string) {
  return { path: `/tmp/repo-${Date.now()}` };
}
 
async function listSourceFiles(repoPath: string) {
  return ["src/index.ts", "src/utils.ts", "src/api/routes.ts"];
}
 
async function analyzeFileWithAI(filePath: string) {
  return { file: filePath, issues: [], score: 85 };
}
 
function generateReport(results: any[]) {
  return {
    summary: `Analyzed ${results.length} files`,
    issues: results.flatMap((r) => r.issues),
    averageScore: results.reduce((s, r) => s + r.score, 0) / results.length,
  };
}
 
async function sendNotification(data: any) {
  console.log("Notification sent:", data);
}

With Trigger.dev, function results are persisted automatically — no hand-rolled checkpoint code. If the server restarts, execution resumes right after the last successful step. As a rule of thumb from running this stack as an indie developer: once a job crosses 30 minutes of runtime or 5+ distinct external API calls, switching from a custom loop to Trigger.dev / Inngest / Temporal pays back in development time almost immediately.

Idempotency Patterns That Actually Hold Up

The single most overlooked piece of Durable Execution is idempotency. Without it, retries silently double everything.

// idempotency-patterns.ts
// Practical idempotency patterns
 
import { randomUUID } from "crypto";
 
// Pattern 1: dedupe via idempotency key
class IdempotentExecutor {
  private processedKeys = new Set<string>();
 
  async execute(
    idempotencyKey: string,
    operation: () => Promise<void>
  ): Promise<void> {
    if (this.processedKeys.has(idempotencyKey)) {
      console.log(`Already processed: ${idempotencyKey}`);
      return;
    }
 
    await operation();
    this.processedKeys.add(idempotencyKey);
  }
}
 
// Pattern 2: idempotent DB writes via upsert
async function upsertRecord(db: any, record: { id: string; data: any }) {
  await db.query(
    `INSERT INTO records (id, data, updated_at)
     VALUES ($1, $2, NOW())
     ON CONFLICT (id) DO UPDATE SET data = $2, updated_at = NOW()`,
    [record.id, JSON.stringify(record.data)]
  );
  // Expected: one row inserted or updated, never duplicated
}
 
// Pattern 3: idempotent payments via transaction ID
async function processPayment(orderId: string, amount: number) {
  const transactionId = `txn_${orderId}_${amount}`;
  const payment = await stripe.paymentIntents.create(
    {
      amount: amount,
      currency: "jpy",
      metadata: { orderId },
    },
    {
      idempotencyKey: transactionId,
    }
  );
  return payment;
  // Same transactionId can be called repeatedly — charged once
}
 
const stripe = { paymentIntents: { create: async (...args: any[]) => ({}) } };

In practice: I write AdMob revenue into Supabase keyed by (date, country, app_id) so reruns just upsert the same rows, and the Slack daily report is gated by a successful INSERT into report_sent_log keyed by (date). With those two patterns in place, double-notification incidents went to zero.

Integration Patterns with Antigravity Agents

Combining Antigravity's multi-agent features with Durable Execution lets you orchestrate multiple AI agents on long-running jobs without losing intermediate work.

// agent-durable-workflow.ts
// Durable management of Antigravity agent work
 
interface AgentTask {
  id: string;
  type: "code-review" | "refactor" | "test-gen";
  targetFiles: string[];
  status: "pending" | "running" | "completed" | "failed";
  result?: unknown;
}
 
class DurableAgentOrchestrator {
  private tasks: Map<string, AgentTask> = new Map();
  private checkpointFile: string;
 
  constructor(sessionId: string) {
    this.checkpointFile = `.agent-state/${sessionId}.json`;
    this.restore();
  }
 
  private restore(): void {
    try {
      const fs = require("fs");
      if (fs.existsSync(this.checkpointFile)) {
        const saved = JSON.parse(
          fs.readFileSync(this.checkpointFile, "utf-8")
        );
        this.tasks = new Map(Object.entries(saved.tasks));
        console.log(
          `Restored ${this.tasks.size} tasks from checkpoint`
        );
      }
    } catch {
      // fresh session
    }
  }
 
  private checkpoint(): void {
    const fs = require("fs");
    const path = require("path");
    fs.mkdirSync(path.dirname(this.checkpointFile), { recursive: true });
    const serialized = {
      tasks: Object.fromEntries(this.tasks),
      savedAt: new Date().toISOString(),
    };
    fs.writeFileSync(
      this.checkpointFile,
      JSON.stringify(serialized, null, 2)
    );
  }
 
  async submitTask(task: Omit<AgentTask, "status">): Promise<string> {
    const existing = this.tasks.get(task.id);
    if (existing?.status === "completed") {
      console.log(`Task ${task.id} already completed, returning cached result`);
      return task.id;
    }
 
    this.tasks.set(task.id, { ...task, status: "pending" });
    this.checkpoint();
    return task.id;
  }
 
  async runAll(): Promise<Map<string, AgentTask>> {
    const pending = [...this.tasks.values()].filter(
      (t) => t.status !== "completed"
    );
 
    for (const task of pending) {
      task.status = "running";
      this.checkpoint();
 
      try {
        task.result = await this.delegateToAgent(task);
        task.status = "completed";
      } catch (error) {
        task.status = "failed";
        console.error(`Task ${task.id} failed:`, error);
      }
      this.checkpoint();
    }
 
    return this.tasks;
  }
 
  private async delegateToAgent(task: AgentTask): Promise<unknown> {
    // In reality, this calls the Antigravity agent API
    console.log(`Delegating ${task.type} to agent for ${task.targetFiles}`);
    return { reviewed: true, suggestions: [] };
  }
}
 
async function main() {
  const orchestrator = new DurableAgentOrchestrator("session-20260330");
 
  await orchestrator.submitTask({
    id: "review-auth",
    type: "code-review",
    targetFiles: ["src/auth/login.ts", "src/auth/session.ts"],
  });
 
  await orchestrator.submitTask({
    id: "refactor-api",
    type: "refactor",
    targetFiles: ["src/api/routes.ts"],
  });
 
  await orchestrator.submitTask({
    id: "gen-tests",
    type: "test-gen",
    targetFiles: ["src/utils/parser.ts"],
  });
 
  const results = await orchestrator.runAll();
  console.log("All tasks complete:", results.size);
  // Expected output: All tasks complete: 3
}
 
main().catch(console.error);

For deeper coverage of advanced multi-agent design patterns, check out the Multi-Agent Orchestration Production Guide.

Seven Production Pitfalls the Official Docs Won't Save You From

Here's the list of things I've actually tripped over — none of which the Trigger.dev / Inngest / Temporal docs surface clearly. Ordered roughly by how badly each one stings.

1. Checkpoint JSON Bloats and Slows Down Every Write

The first one bites quickly. If you stash all fetched records in state.data, a few thousand rows of AdMob data balloons the JSON to 30–80 MB and every checkpoint write costs you seconds. Store raw data in S3 / R2 and put only the storage URI in state.data. That single fix shrank Step 1 of my pipeline from 11 minutes to 7.

2. "Retry Succeeded" Quietly Sends the Notification Twice

If the order is "success → send Slack message → restart," the next retry sends the message again. Fix: gate the notification on a successful INSERT into a report_sent_log table keyed by (date). Without this, every rerun fires a duplicate daily report. I've seen this happen in my own logs and it's deeply embarrassing.

3. Rate-Limit Retries Blast a Shared Quota

Exponential backoff on RESOURCE_EXHAUSTED is correct — but if multiple jobs share the same API token, retries from one can starve the others. For AdMob I now isolate the aggregation job into a separate Google Cloud project from the telemetry client. Per-job quotas, not per-account quotas.

4. Partial Recovery Corrupts the Most Recent Data

Stop at T1, resume at T2, and data that's only "real" between T1 and T2 may quietly slip. AdMob has a one-hour reporting lag, so if you resume at T2 and ask for "yesterday" without specifying an explicit time window, you'll get partial data and your anomaly detector will misfire. Always persist the explicit [from, to] window in the checkpoint and reuse it on resume.

5. Time Drift — Mixing UTC and Local Time

This one I owned spectacularly. AdMob is UTC, App Store Connect is America/Los_Angeles, Google Play is America/Los_Angeles, the Slack post is JST. If you store "date" as a string in the checkpoint, recovery can overwrite the wrong day's row. Always store ISO 8601 with timezone in the checkpoint and convert to display timezone at the very end.

6. Cold Start Beats Your First Checkpoint Write

On Cloudflare Workers and AWS Lambda, the cold-start window sometimes consumes enough of the budget that the first KV / S3 PUT (your first checkpoint) times out. Make the very first step a lightweight "boot heartbeat" write — I explicitly drop state.lastCheckpoint = "boot-ok" before any real work begins.

7. Monitoring Blind Spot — Stuck Checkpoints Look Healthy

This is the scariest one. Durable Execution swallows failures, which makes it easy to silently get stuck on the same step forever. No errors, no progress. The cure: emit saveCheckpoint's timestamp as a Prometheus counter or GA4 custom event, and alert if it doesn't move for 15 minutes. Adding that single alert is the difference between sleeping soundly through a nightly job and waking up to a four-day-old failure.

How to Get Antigravity to Generate This Pattern Reliably

A final section on prompt design, because the value of Durable Execution multiplies when the AI agent generating the code understands the pattern. The four-step approach below is what got my AdMob pipeline shipped in roughly a month.

Step 1: Make Plan Mode Enumerate Failure Modes First

Don't ask for code yet. Ask: "List ten things that could go wrong with this job." Rate limits, token expiry, network drops, data shape changes, mid-deploy dependencies — let Antigravity surface the surface area. Then pick the five to seven the implementation will actually handle, and put those in the spec.

Step 2: Lock Down Step Boundaries and Checkpoint Granularity Up Front

Ask: "Break this into at most five steps, with typed inputs and outputs for each." Without this, Antigravity tends to dump everything into one function. With typed step boundaries, checkpoint granularity falls out naturally.

Step 3: Specify Resume Behavior Per Step

"If Step 3 fails, retry only Step 3 — preserve Steps 1 and 2's results" beats the generic "make it durable" prompt by roughly 2x in output quality. One prompt per step, not one mega-prompt.

Step 4: Force Sandbox Failure-Injection Testing

Don't ship without it. Add a throwAt(step: number) debug flag and verify that resume from each step actually works. Skip this and the bill comes due in production a month later. Always.

In Closing

Durable Execution is the boring, load-bearing piece of running AI agents on long jobs. Since I moved my morning AdMob pipeline onto it, the number of nightly Slack alerts I get has dropped from roughly 12 a month to one. If you're looking for a first move, pick the longest job you currently run and carve in just three checkpoints. That alone changes how reruns feel.

I'm still tuning this stack as I run it. If you're an indie developer wrestling with the same long-running jobs, I hope this was useful. Thanks for reading.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.