Trusting Temporal Workflows in Production — Field Notes on Idempotency, Retry Triage, and Saga Compensation

Practical notes from running Temporal as a production backend: how to make activities idempotent for real, where to draw the line between retryable and fatal errors, how to keep Saga compensation from firing twice, and how to make it all observable—built with Antigravity in the loop.

Temporal workflows idempotency⁶ Saga retries Antigravity²³³ distributed systems

✦ Premium Article

It started with a duplicate-charge ticket at midnight

A few weeks after moving a payment-and-provisioning flow onto Temporal, I got a ticket: "I was billed twice for the same order." The logs told a familiar story. The charge activity had been judged timed out and retried, but the first attempt had actually succeeded. The response just never made it back.

Temporal is powerful because it treats the workflow code itself as durable execution state—if a worker dies, it resumes from exactly the right point. That same strength cuts both ways. Activities will happily run more than once depending on how you wrote them, and the moment you forget that, side effects double up.

These are field notes from putting Temporal under a real backend, organized around four things I actually tripped on: idempotency, retry triage, Saga compensation, and observability. This isn't an introduction—it's meant for people already running Temporal who want fewer "we got burned here" moments. I build most of this with an Antigravity agent, so I'll also point out which decisions a human still needs to keep hold of.

Designing for at-least-once changes everything

The guarantee Temporal gives an activity is at-least-once, not exactly-once. A timeout, a worker crash, a transient network failure—any of them and Temporal calls the activity again.

The subtle part is that a retry can happen not because the activity failed, but because it succeeded and the result didn't come back. If a payment API completes the work but the connection drops before the response returns, Temporal sees a failure and tries again. So every activity with a side effect has to be idempotent, full stop.

How you achieve that depends on the side effect. For writes to your own database, a unique constraint plus ON CONFLICT absorbs duplicates. For an external API, riding on the provider's idempotency-key feature is the reliable path.

// src/temporal/activities/billing.ts
import { ApplicationFailure } from '@temporalio/activity';
import { db, charges } from '../../db';
import { stripe } from '../../lib/stripe';
 
interface ChargeInput {
  orderId: string;   // a stable ID already fixed by the workflow
  customerId: string;
  amount: number;
  currency: string;
}
 
/**
 * Charge activity. Idempotency is enforced in two layers:
 *   1) Stripe's idempotencyKey makes the provider process one request once
 *   2) Our own charges table records the result under a unique key,
 *      so a re-run trusts the record over re-calling the API
 */
export async function chargeCustomer(input: ChargeInput): Promise<string> {
  const { orderId, customerId, amount, currency } = input;
 
  // Check our own record first. If a success already exists, return without calling out.
  const existing = await db.query.charges.findFirst({
    where: (c, { eq }) => eq(c.orderId, orderId),
  });
  if (existing?.status === 'succeeded') {
    return existing.stripeChargeId;
  }
 
  // Use orderId itself as the idempotency key—retries won't double-charge.
  const intent = await stripe.paymentIntents.create(
    { amount, currency, customer: customerId, confirm: true },
    { idempotencyKey: `charge-${orderId}` },
  );
 
  if (intent.status !== 'succeeded') {
    // A business failure; retrying won't change the outcome, so make it non-retryable.
    throw ApplicationFailure.create({
      message: `Payment did not complete: ${intent.status}`,
      type: 'PaymentNotCompleted',
      nonRetryable: true,
    });
  }
 
  await db
    .insert(charges)
    .values({ orderId, stripeChargeId: intent.id, status: 'succeeded', amount })
    .onConflictDoNothing();
 
  return intent.id;
}

What makes this work is that the idempotency key is a stable ID fixed by the workflow. If you call crypto.randomUUID() inside the activity and use that as the key, it changes on every retry and idempotency collapses. Generate the key-bearing ID in the workflow body and pass it as an argument. Temporal replays workflows deterministically, so an ID generated in the workflow stays the same across retries.

My midnight duplicate charge was exactly this: the "check our own record first" step was missing, and the key was being generated inside the activity. Leaning on Stripe's key alone wasn't enough once the key itself wasn't stable.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Concrete ways to make activities idempotent under at-least-once execution, and how to choose between your own dedupe and an external idempotency key

✦A type-driven split between errors worth retrying and ones to stop immediately, operated through nonRetryableErrorTypes

✦Keeping Saga compensation safe when it runs partially: idempotent rollbacks and tracing them through OpenTelemetry

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Draw the retry line with types

By default Temporal retries failed activities with exponential backoff. Convenient, but not everything should be retried. Hammering a "balance insufficient" or "invalid input" failure that will never change just burns time and external-API budget.

I sort failures into three buckets: transient ones that retrying fixes (network, 5xx, rate limits), business failures that no amount of retrying helps (validation, declined payment), and misconfigurations a human needs to see (auth errors, missing resources). Only the first bucket is retryable; the rest stop immediately.

Temporal gives you two levers. On the workflow side you set retry.nonRetryableErrorTypes per activity; on the activity side you raise an ApplicationFailure with nonRetryable: true. The former declares "this activity stops on this type," the latter declares "this failure is inherently not worth retrying." Different jobs.

// src/temporal/workflows/checkout.ts
import { proxyActivities } from '@temporalio/workflow';
import type * as acts from '../activities/billing';
 
const { chargeCustomer } = proxyActivities<typeof acts>({
  startToCloseTimeout: '30 seconds',
  retry: {
    initialInterval: '1s',
    backoffCoefficient: 2,
    maximumInterval: '30s',
    maximumAttempts: 5,
    // validation and settled business failures are not retried
    nonRetryableErrorTypes: ['PaymentNotCompleted', 'InvalidInput'],
  },
});

One caution: leaving maximumAttempts unbounded (0) means an unrecoverable failure retries forever and the workflow lingers as "running." That's bad for observability too, so for anything touching an external dependency I set a cap and treat hitting it as an alert. A failure that retries can't absorb is precisely the thing operations wants to know about.

Don't skip startToCloseTimeout either. Without it, Temporal can't tell that an activity has silently hung, so retries never begin. Set the timeout to "a generous upper bound on normal execution time." Too long and you detect failures late; too short and you retry healthy work by mistake.

Half-finished Saga compensation is the scariest case

When a multi-service flow succeeds partway and then fails, the Saga pattern undoes the already-successful steps through compensation. Temporal lets you write compensation as plain workflow code. The accident-prone moment is when the compensation itself fails and gets retried.

Say you reserve inventory, charge, and schedule shipment, and shipment scheduling fails. You run two compensations—refund the charge and release the inventory—but if the refund API times out, it retries too. If compensation isn't idempotent, you release inventory twice or attempt two refunds. So compensating activities must be idempotent as well, not just the forward path.

// src/temporal/workflows/fulfillment.ts
import { proxyActivities, log } from '@temporalio/workflow';
import type * as acts from '../activities';
 
const a = proxyActivities<typeof acts>({
  startToCloseTimeout: '30 seconds',
  retry: { maximumAttempts: 5 },
});
 
export async function fulfillOrder(order: OrderInput): Promise<void> {
  // Push each undo step as we complete a forward step.
  const compensations: Array<() => Promise<void>> = [];
 
  try {
    const reservation = await a.reserveInventory(order);
    compensations.push(() => a.releaseInventory(reservation.id));
 
    const chargeId = await a.chargeCustomer(order);
    // Compensation takes the *result ID* of completed work and undoes it idempotently.
    compensations.push(() => a.refundCharge({ chargeId, orderId: order.id }));
 
    await a.scheduleShipment(order);
  } catch (err) {
    log.warn('Forward path failed; running compensations in reverse', { orderId: order.id });
    // Undo last-pushed first (reverse dependency order).
    for (const compensate of compensations.reverse()) {
      try {
        await compensate();
      } catch (compErr) {
        // Never swallow a failed compensation—surface it to a human.
        log.error('Compensation failed; manual handling required', { orderId: order.id, compErr });
      }
    }
    throw err; // settle the workflow as failed
  }
}

Three design points. First, push compensations onto a stack and undo only what you actually ran, in reverse. Calling every rollback up front tries to undo steps you never executed and creates new accidents. Second, a compensating activity receives the result ID to undo, not a recomputation—refundCharge is an idempotent reversal that refunds once even if called twice. Third, don't swallow a failed compensation. An order whose compensation failed is your highest-priority investigation, because consistency may have broken; log it and page the operations channel.

Running this solo as an indie developer, it's tempting to think "compensation rarely fails, skip it." But a silent inconsistency left behind on the rare time it does fail is the worst outcome, so this is the one place I refuse to cut corners.

Make observability match the meaning of the workflow

Temporal's Web UI shows execution history on a timeline, which already makes debugging far easier. In production, though, you want it continuous with your existing distributed tracing, so wire in OpenTelemetry. @temporalio/interceptors-opentelemetry spans workflows and activities automatically.

// src/temporal/worker.ts
import { Worker, defaultSinks } from '@temporalio/worker';
import {
  makeWorkflowExporter,
  OpenTelemetryActivityInboundInterceptor,
} from '@temporalio/interceptors-opentelemetry/lib/worker';
import { resource, traceExporter } from './otel';
 
async function run() {
  const worker = await Worker.create({
    workflowsPath: require.resolve('./workflows'),
    activities: require('./activities'),
    taskQueue: 'order-processing',
    sinks: { ...defaultSinks(), ...makeWorkflowExporter(traceExporter, resource) },
    interceptors: {
      activityInbound: [(ctx) => new OpenTelemetryActivityInboundInterceptor(ctx)],
    },
  });
  await worker.run();
}
run().catch((e) => { console.error(e); process.exit(1); });

Connected traces are valuable, but what actually paid off in operations was Search Attributes. Register orderId or customerId as workflow search attributes and, when a ticket lands, you can pull up "this customer's workflow for this order" by ID instantly. A support desk that doesn't know a trace ID can still locate the workflow in the language of the business—huge in practice.

// attach search attributes when starting the workflow
await client.workflow.start(fulfillOrder, {
  taskQueue: 'order-processing',
  workflowId: `order-${order.id}`,        // business ID as the workflow ID
  args: [order],
  searchAttributes: { CustomerId: [order.customerId] },
});

For metrics I dashboarded three things first: number of lingering workflows, count of activities that hit their retry cap, and count of compensations that ran. All three represent "anomalies that auto-recovery couldn't absorb," and they keep Temporal from quietly working so hard that the real problem disappears from view.

What to delegate to Antigravity, and what to keep

When I hand this work to an Antigravity agent, the one thing I keep explicit in AGENTS.md is the policy for idempotency, retry triage, and Saga compensation. An agent can write most of the code, but "which errors are non-retryable" and "the order and idempotency of compensation" are decisions that reach into business meaning. Generate them with that left vague and you get code that runs but breaks.

Conversely, activity scaffolding, the test harness, the OpenTelemetry wiring—give the agent the policy and it assembles the boilerplate quickly. In practice I let it draft the worker bootstrap and the retry policies, and spent my time reviewing how idempotency keys were derived and whether compensation was correct. Deciding where judgment lives is, for me, the realistic line for keeping production quality while building AI-first.

If you want a next step, pick one activity in a workflow you're already running and write down, on paper, "if this runs twice on a retry, what doubles?" That's the first move toward idempotency, and it's usually exactly where the most accident-prone spot is hiding.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.