I run a small SaaS, and one month our MRR dropped sharply for no obvious reason. New signups were growing. Engagement was stable. When I finally chased it down inside the Stripe dashboard, the answer was unsexy: expired cards and one-off network declines. Roughly six out of ten cancellations that month traced back to those.
What hit me wasn't the loss of revenue. It was the realization that "how I handle failed payments" was a survival-line for the product, equal in weight to anything I'd ship as a feature. If I could automate that recovery flow well, it would compound just like new growth.
This article walks through the dunning recovery pipeline I rebuilt afterward, using Antigravity's AI agents at the center. I'll cover capturing failure events, controlling retries, branching email copy by reason, Slack escalation, and a final winback step — all in patterns I run in production today.
Why an AI Agent Belongs at the Center of Dunning
"Dunning" is the polite name for the dance you do with a customer after a charge fails. It looks small from the outside but the decision tree is surprisingly thick:
- An expired card, an insufficient-funds decline, and a 3DS authentication failure all need different copy and different timing.
- Pushing more retries lifts recovery rate, but past a point it makes customers feel hounded.
- If the same customer fails twice in a week, sending another email usually backfires.
- Corporate cards and personal cards often need different recipients on the notification side.
I tried to encode all this in a single webhook handler with if branches. It got past 150 lines before I lost track of what was firing when. I threw it out.
The shift that worked was moving the decision logic out of the handler entirely and into an Antigravity sub-agent. The webhook handler shrinks down to "validate, normalize, hand off." The agent owns the policy.
Pipeline Overview
The complete shape:
- Receive a Stripe webhook (
invoice.payment_failed,customer.subscription.updated, etc.) - Idempotency check + event normalization (Cloudflare D1 or Postgres)
- Hand off to the Dunning Agent with normalized failure reason, customer profile, and retry history
- Decide: keep retrying via Stripe, notify the customer, or trigger a winback offer
- Execute: send email through Resend, alert internal Slack, update KV/DB state
- Observe: log everything as structured JSON and chart it in Looker Studio
Steps 3 and 4 are where the agent earns its place. Stripe's built-in Smart Retries are good, but they don't speak your domain — "skip notifications during free trial," "give annual subscribers extra grace," "different copy for customers who came back after canceling once before." Encoding those rules in a prompt + tool contract beats an ever-growing nest of conditionals.
Step 1: Receive Webhooks Idempotently
The webhook handler comes first. Stripe will sometimes deliver the same event multiple times, so deduplicate before you touch any business logic.
// app/api/webhook/stripe-billing/route.ts
import Stripe from "stripe";
import { getCloudflareContext } from "@opennextjs/cloudflare";
// Set as Cloudflare Worker secrets:
// STRIPE_SECRET_KEY=sk_live_... (or sk_test_...)
// STRIPE_WEBHOOK_BILLING_SECRET=whsec_...
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!, {
apiVersion: "2025-08-27.basil",
});
export async function POST(req: Request) {
const sig = req.headers.get("stripe-signature");
if (!sig) return new Response("missing signature", { status: 400 });
const body = await req.text();
let event: Stripe.Event;
try {
event = await stripe.webhooks.constructEventAsync(
body,
sig,
process.env.STRIPE_WEBHOOK_BILLING_SECRET!,
undefined,
Stripe.createSubtleCryptoProvider() // required on Cloudflare Workers
);
} catch (err) {
console.error("webhook signature verification failed", err);
return new Response("invalid signature", { status: 400 });
}
// Idempotency: event.id is the dedup key
const { env } = getCloudflareContext();
const seen = await env.BILLING_KV.get(`evt:${event.id}`);
if (seen) {
return new Response("duplicate", { status: 200 });
}
await env.BILLING_KV.put(`evt:${event.id}`, "1", { expirationTtl: 60 * 60 * 24 * 7 });
if (event.type === "invoice.payment_failed") {
await enqueueDunning(event.data.object as Stripe.Invoice, env);
}
return new Response("ok", { status: 200 });
}The handler does not make business decisions. Stripe will keep retrying for three days if you respond slowly, which means a heavy handler causes a chain reaction. Aim for under 100 ms: validate, dedupe, enqueue, return.
Expected behavior: even if the same evt_xxx arrives five times, enqueueDunning runs exactly once.
Step 2: Normalize the Failure into Domain Vocabulary
The Stripe Invoice object is information-dense and not friendly to feed to an AI agent raw. I run it through a normalization layer.
// lib/dunning/normalize.ts
export type DunningContext = {
customerId: string;
customerEmail: string;
amountDue: number; // smallest unit (JPY = yen, USD = cents)
currency: string;
failureReason: "card_expired" | "insufficient_funds" | "authentication_required" | "do_not_honor" | "unknown";
attemptCount: number;
planTier: "basic" | "pro" | "team";
isAnnual: boolean;
customerSegment: "trial" | "new" | "loyal" | "winback";
lastSuccessfulPaymentAt: number | null; // unix seconds
totalLifetimeValueJpy: number;
};
export async function buildDunningContext(invoice: Stripe.Invoice, env: Env): Promise<DunningContext> {
const customer = await stripe.customers.retrieve(invoice.customer as string);
const subs = await stripe.subscriptions.list({ customer: invoice.customer as string, limit: 1 });
const sub = subs.data[0];
const code = invoice.last_finalization_error?.decline_code ?? invoice.last_payment_error?.decline_code;
const failureReason = mapDeclineCode(code);
if (!failureReason) {
console.warn("unmapped decline_code", code, "invoice", invoice.id);
}
return {
customerId: invoice.customer as string,
customerEmail: (customer as Stripe.Customer).email ?? "",
amountDue: invoice.amount_due,
currency: invoice.currency,
failureReason: failureReason ?? "unknown",
attemptCount: invoice.attempt_count,
planTier: detectPlanTier(sub),
isAnnual: sub.items.data[0].price.recurring?.interval === "year",
customerSegment: await detectSegment(invoice.customer as string, env),
lastSuccessfulPaymentAt: await getLastSuccessfulPaymentAt(invoice.customer as string, env),
totalLifetimeValueJpy: await getLtvJpy(invoice.customer as string, env),
};
}
function mapDeclineCode(code?: string | null): DunningContext["failureReason"] | undefined {
switch (code) {
case "expired_card": return "card_expired";
case "insufficient_funds": return "insufficient_funds";
case "authentication_required": return "authentication_required";
case "do_not_honor": return "do_not_honor";
default: return undefined;
}
}Why translate Stripe codes into your own vocabulary? Stripe's set of decline_code values evolves. Pinning the agent prompt directly to those strings makes it brittle. Cap your domain to four or five reasons, send anything new to unknown, and alert on it. That single discipline keeps the agent stable for years.
If you're already deep in Stripe webhook plumbing, my full SaaS pattern — including metering and invoice issuance — is in Antigravity x Stripe Full-Stack SaaS Deployment Guide. Reading both side by side gives a clearer picture of how dunning fits into the broader billing surface.
Step 3: Define the Dunning Agent in Antigravity
This is where the AI agent comes in. Drop a file at agents/dunning-orchestrator.md and make the tools and policy explicit.
# Dunning Orchestrator Agent
## Role
Receive a Stripe payment failure event and choose exactly one recovery action that protects both customer experience and revenue.
## Available tools
- send_email(template_id, customer_email, variables): send via Resend
- post_slack(channel, blocks): post to Slack
- update_customer_state(customer_id, state, note): write to internal KV
- request_stripe_smart_retry(invoice_id, schedule): reschedule Stripe Smart Retries
- offer_winback_discount(customer_id, percent_off, valid_days): issue coupon + send email
- escalate_to_human(reason): hand off to support
## Decision policy
1. failureReason = "card_expired" -> send card update link (template "card_update_request")
2. failureReason = "insufficient_funds" AND attemptCount <= 2 -> let Stripe Smart Retries continue + send a soft reminder
3. attemptCount >= 3 AND planTier in ["pro","team"] AND totalLifetimeValueJpy > 30000 -> escalate_to_human
4. customerSegment = "loyal" AND failureReason != "card_expired" -> offer_winback_discount(percent_off=20, valid_days=14)
5. unknown cases -> escalate_to_human + Slack notification
## Hard constraints
- Never send more than one email to the same customer within 24 hours.
- Always check that customer_email is non-empty before calling send_email.
- offer_winback_discount can fire at most once per 90 days for the same customer.The "hard constraints" section is the part I refuse to compromise on. AI agents are flexible, but flexibility without guardrails turns into surprise behavior in production. The single line "never send more than one email per 24 hours" was the difference between launching nervously and launching with confidence on day one.
Step 4: Invoke the Agent and Handle Results
Don't call the agent directly from the webhook handler. Push the work onto a queue (Cloudflare Queues or D1 + Cron) because agent calls can take seconds.
// lib/dunning/run-agent.ts
import { GoogleGenAI, FunctionDeclaration } from "@google/genai";
const tools: FunctionDeclaration[] = [
{
name: "send_email",
description: "Send an email using a fixed template id and variables",
parameters: {
type: "object",
properties: {
template_id: { type: "string", enum: ["card_update_request", "soft_payment_reminder", "winback_offer", "final_warning"] },
customer_email: { type: "string" },
variables: { type: "object" },
},
required: ["template_id", "customer_email"],
},
},
// ... declare the other tools the same way
];
export async function runDunningAgent(ctx: DunningContext, env: Env) {
// Hard constraints (always run before the agent decision)
if (!ctx.customerEmail) {
await postSlack(env, `[dunning] customer ${ctx.customerId} has no email - skipping`);
return { action: "skipped", reason: "no_email" };
}
const recent = await env.BILLING_KV.get(`mail-throttle:${ctx.customerId}`);
if (recent) {
return { action: "skipped", reason: "throttled" };
}
const ai = new GoogleGenAI({ apiKey: env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: "gemini-3-pro",
contents: [
{ role: "user", parts: [{ text: buildPrompt(ctx) }] },
],
config: {
tools: [{ functionDeclarations: tools }],
systemInstruction: await loadAgentMd(env, "dunning-orchestrator.md"),
temperature: 0.2, // tighten determinism
},
});
// The agent is constrained to one function_call per invocation.
const call = response.functionCalls?.[0];
if (!call) {
await postSlack(env, `[dunning] agent returned no action for ${ctx.customerId}`);
return { action: "no_action" };
}
const result = await dispatchTool(call.name, call.args, ctx, env);
await env.BILLING_KV.put(
`mail-throttle:${ctx.customerId}`,
"1",
{ expirationTtl: 60 * 60 * 24 } // 24h throttle
);
return { action: call.name, ...result };
}The deliberate constraint here is one action per agent call. If the agent could fire "send email + Slack alert + issue coupon" in the same response, you immediately face partial-failure questions: what state is the customer in if the email succeeded but the coupon issue failed? Limit it to one step, then call again if you need a follow-up. Rollback design becomes trivial.
Expected behavior: within 30 seconds of a failed charge, the agent picks one action, Resend delivers the email, and a 24h throttle entry lands in KV.
Step 5: Let the AI Tailor Email Copy
I don't ship fully static templates. Inside Resend + React Email, the agent gets a small slot — usually a "hero paragraph" — that it adapts to the situation while the rest of the structure stays fixed.
// emails/CardUpdateRequest.tsx
import { Body, Container, Heading, Text, Button, Html } from "@react-email/components";
export default function CardUpdateRequest({
firstName,
amountFormatted,
updateLink,
empathyParagraph, // generated by Gemini for the specific context
}: { firstName: string; amountFormatted: string; updateLink: string; empathyParagraph: string }) {
return (
<Html>
<Body>
<Container>
<Heading as="h2">Hi {firstName}, could you take a moment to update your card?</Heading>
<Text>{empathyParagraph}</Text>
<Text>The amount we tried to charge was {amountFormatted}. The button below opens a secure form where you can add a new card.</Text>
<Button href={updateLink}>Update card</Button>
<Text>If you've already taken care of this, you can ignore this email.</Text>
</Container>
</Body>
</Html>
);
}The trick is keeping the AI's surface area small. Asking it to write the entire email creates inconsistency; asking it for nothing makes the message feel automated and unread. A single paragraph it can warm up — based on plan tier, segment, time since last successful payment — is the sweet spot.
For the broader emailing pattern (delivery retries, ordering, sandbox routing), see Building a React Email Pipeline with Antigravity and Resend. Sharing template infrastructure between dunning and routine product emails dramatically lowers operational load.
Step 6: Slack Escalation, Designed for Speed
The agent must hand off cleanly when it isn't sure. The Slack message is the contract for that handoff, and it has to give the on-call person enough to decide in three seconds.
// lib/dunning/slack.ts
export function buildEscalationBlocks(ctx: DunningContext, reason: string) {
return [
{
type: "header",
text: { type: "plain_text", text: ":rotating_light: Dunning escalation" },
},
{
type: "section",
fields: [
{ type: "mrkdwn", text: `*Customer*\n<https://dashboard.stripe.com/customers/${ctx.customerId}|${ctx.customerEmail}>` },
{ type: "mrkdwn", text: `*Plan*\n${ctx.planTier} (${ctx.isAnnual ? "annual" : "monthly"})` },
{ type: "mrkdwn", text: `*Amount*\n${ctx.amountDue / 100} ${ctx.currency.toUpperCase()}` },
{ type: "mrkdwn", text: `*Reason*\n${ctx.failureReason}` },
{ type: "mrkdwn", text: `*Attempt*\n${ctx.attemptCount}` },
{ type: "mrkdwn", text: `*LTV (JPY)*\n${ctx.totalLifetimeValueJpy.toLocaleString()}` },
],
},
{
type: "section",
text: { type: "mrkdwn", text: `*Why escalated*\n${reason}` },
},
{
type: "actions",
elements: [
{ type: "button", text: { type: "plain_text", text: "Send winback offer" }, action_id: "dunning_winback" },
{ type: "button", text: { type: "plain_text", text: "Mark as lost" }, action_id: "dunning_mark_lost", style: "danger" },
],
},
];
}I've redesigned this Slack block twice because my early version "just notified that something failed." That's useless — the on-call ends up opening the Stripe dashboard anyway. Including the LTV, plan, reason, and attempt count in the same block is what made it actually decision-ready.
Common Pitfalls I Have Hit Personally
A few I rewrote my way out of:
1. Doubling up Stripe Smart Retries with your own retry loop
Stripe runs Smart Retries (up to four attempts) by default. If you don't realize that and add an app-side cron retry on top, you can hit the customer's card eight times for one invoice. Always check Settings -> Billing -> Subscriptions -> Retries first, and reserve any app-side retry for the moment Stripe gives up.
2. Listening to invoice.payment_failed only
There is more than one failure shape. Don't merge them.
invoice.payment_failed: automatic collection failedinvoice.payment_action_required: customer needs to complete 3DScustomer.subscription.updated(status: past_due to unpaid): all retries exhaustedcustomer.subscription.deleted: fully canceled
3DS-pending customers should not get "please update your card." Different reason, different copy. Branch them at the webhook layer.
3. Sending live emails from the test environment
Inspect STRIPE_SECRET_KEY for the sk_test_ vs sk_live_ prefix and route Resend differently. A safe default: if (process.env.STRIPE_SECRET_KEY?.startsWith("sk_test_")) { /* sandbox */ } rerouting to an internal-only inbox. I learned this the hard way after a single test event leaked to a real customer.
4. The "reason: unknown" swamp
Run SELECT failure_reason, count(*) FROM dunning_events GROUP BY 1 weekly. The unknown bucket grows quietly as Stripe adds new decline codes. Once it crosses 10% of total failures, it's time to extend mapDeclineCode.
5. Coupon overuse
Firing a winback coupon on every failure trains your most willing-to-pay customers to wait for discounts. Gate offer_winback_discount on segment + LTV + last-payment-date triple. After I added customerSegment === "loyal" && totalLifetimeValueJpy > 30000, my monthly coupon issuance dropped to one-fifth without hurting recovery rate.
Observability I Always Set Up
Five charts I keep on permanent display in Looker Studio:
- Failure events per day, broken down by reason
- Recovery rate (
payment_failed->payment_succeededwithin 14 days) - Click-through rate from email to card update form
- Slack escalations and what fraction got resolved within 24h
- Share of
unknownfailure reasons
Recovery rate is the one metric I check first. The agent rollout in my own product moved this from 38% to 61% in the first month. When I did the math, that recovery delta was as valuable as a sizable chunk of new MRR — without any acquisition spend.
Wiring the Tool Dispatcher
The tool dispatcher is the place where agent intent becomes side effects. Get it small and well-tested, because if any of these branches are wrong you'll discover it through customers, not unit tests.
// lib/dunning/dispatch.ts
type Args = Record<string, unknown>;
export async function dispatchTool(name: string, args: Args, ctx: DunningContext, env: Env) {
switch (name) {
case "send_email": {
const { template_id, customer_email, variables } = args as {
template_id: string;
customer_email: string;
variables?: Record<string, unknown>;
};
// Hard-validate one more time at the dispatch boundary
if (!customer_email || customer_email !== ctx.customerEmail) {
throw new Error(`email mismatch: agent=${customer_email} ctx=${ctx.customerEmail}`);
}
const empathy = await composeEmpathyParagraph(ctx, env);
return await sendResendEmail(env, template_id, customer_email, {
firstName: ctx.customerSegment === "trial" ? "there" : (variables?.firstName ?? ""),
amountFormatted: formatAmount(ctx.amountDue, ctx.currency),
updateLink: await mintCardUpdateLink(ctx.customerId, env),
empathyParagraph: empathy,
...variables,
});
}
case "post_slack":
return await postSlackBlocks(env, (args as any).channel, (args as any).blocks);
case "update_customer_state":
return await env.BILLING_KV.put(
`cust:${(args as any).customer_id}`,
JSON.stringify({ state: (args as any).state, note: (args as any).note, updatedAt: Date.now() })
);
case "request_stripe_smart_retry":
return await stripe.invoices.update((args as any).invoice_id, {
metadata: { dunning_reschedule: (args as any).schedule },
});
case "offer_winback_discount":
return await issueWinbackCoupon(ctx, args as any, env);
case "escalate_to_human":
return await escalate(ctx, (args as any).reason, env);
default:
throw new Error(`unknown tool: ${name}`);
}
}The double-validation on customer_email exists because I want to fail loudly if the agent ever fabricates a different address. It hasn't happened in production, but the check costs nothing and the alternative — a misdirected email — costs trust.
Expected behavior: any tool the agent picks resolves to a single, named, traceable side effect, and unknown tool names raise a 500 instead of silently passing.
Composing the Empathy Paragraph
Empathy paragraphs are the one slot of natural language I let the agent generate per email. Keeping the prompt narrow keeps the output reliable.
// lib/dunning/empathy.ts
export async function composeEmpathyParagraph(ctx: DunningContext, env: Env) {
const ai = new GoogleGenAI({ apiKey: env.GEMINI_API_KEY });
const segmentTone =
ctx.customerSegment === "loyal" ? "thank them for the years they've been with us" :
ctx.customerSegment === "trial" ? "lower the pressure; this is their first billing experience" :
"stay friendly and concise";
const reasonHint =
ctx.failureReason === "card_expired" ? "Mention that cards expire and it happens to everyone." :
ctx.failureReason === "insufficient_funds" ? "Avoid any phrasing that sounds judgmental about funds." :
ctx.failureReason === "authentication_required" ? "Explain that the bank needs an extra confirmation step." :
"Keep the cause vague; we don't have certainty about what happened.";
const res = await ai.models.generateContent({
model: "gemini-3-flash",
contents: [{ role: "user", parts: [{ text:
`Write one short paragraph (40-70 words, English, second person) for an email about a failed charge. ` +
`Tone: ${segmentTone}. Hint: ${reasonHint}. Avoid technical Stripe terminology. ` +
`End with a soft action prompt to update payment.`
}]}],
config: { temperature: 0.6 },
});
return res.text?.trim() ?? "We had trouble processing your latest payment. Could you take a quick moment to update your card details?";
}I use gemini-3-flash here intentionally: cheaper, faster, and the output quality is more than enough for a 60-word paragraph. The slow model belongs in the orchestration step, not in copywriting.
Migrating from a Manual Process
If you're starting from a fully manual flow, here's the migration order I'd recommend, distilled from my own missteps:
- Week 1: deploy only the webhook handler and event normalization. Log everything, do nothing else. Two purposes: confirm your idempotency works under real Stripe traffic, and start the data series for recovery-rate baselining.
- Week 2: introduce the agent in dry-run mode. Have it produce its decision and the would-be tool call, but route every action to Slack instead of executing. Watch for at least 50 events. Reject the rollout if more than 5% of decisions look wrong to you.
- Week 3: enable email and KV updates only. Hold winback coupons and human escalations on Slack approval. This is where you'll find your
unknownrate, your throttle hits, and the edge cases your prompt missed. - Week 4: turn on the rest, including the winback coupon arm. Add a kill switch (a single KV key the handler reads) so you can pause the agent in five seconds if something looks wrong.
Don't compress this. Fast failure recovery feels like the kind of code you'd ship in a weekend, but the cost of a misfire is real human distrust. I lost a few weeks' MRR worth in mistakes during my own initial rollout because I skipped step 3.
What Actually Changed in My Numbers
Numbers help so let me share what shifted in the product I run, before and after this pipeline ran for two months.
- Recovery rate (failed → succeeded within 14 days) moved from 38% to 61%, then settled at around 58% after the novelty of the email faded.
- Average time-to-recovery dropped from 6.4 days to 1.9 days, mostly because card-update emails now go out within minutes instead of whenever I opened the dashboard.
- Customer support tickets containing the word "billing" dropped by roughly half. Most of the long thread cleanups I used to handle by hand simply stopped happening because the agent caught the issue first.
- Coupon issuance dropped to one-fifth after I gated winback offers on segment + LTV. Recovery rate on those eligible cohorts stayed flat, which told me the constraint was healthy.
These aren't huge SaaS numbers and they aren't meant to impress. They're mid-product, mid-budget numbers from one indie developer's account. The point is that the levers exist even at small scale, and the agent pattern compresses the work to the point where one person can run the recovery surface that small teams used to staff.
A side benefit I didn't expect: writing the agent prompt forced me to articulate my own customer policy in plain language. I had never written down "we don't pressure trial users" or "loyal customers get a winback offer up to once a quarter." The agent prompt became a living policy document that my future self can edit in five minutes.
Testing Strategy
A few patterns I now consider non-negotiable:
- Replay tests: keep a folder of real Stripe webhook payloads (with PII redacted) and replay them through your handler in CI. Easier to maintain than synthetic fixtures, and far more honest.
- Dispatcher snapshots: for each tool branch, snapshot the side-effect arguments. Diffing the snapshot is faster than reading through assertions.
- Stub the agent: in unit tests, replace the agent call with a fixed function-call response. The agent itself has its own evaluation loop separate from the pipeline tests.
- End-to-end smoke: every deploy, fire one synthetic
invoice.payment_failedagainst a sentinel customer and confirm the test inbox receives the right template within 60 seconds. This single guardrail caught a regression for me when I changed Resend SDK versions.
For broader testing strategy on agent-driven pipelines, I keep coming back to the patterns from Antigravity AgentKit 2.0 Unit Testing with Vitest Guide — most of those techniques apply almost unchanged here.
Beyond Dunning: One Agent Pattern, Many Lifecycles
Once the dunning agent works, the same shape generalizes:
- Trial-end reminders: trigger on
customer.subscription.trial_will_end, classify usage as "engaged" or "lapsed," and let the agent write the right nudge. - Upgrade prompts: when usage approaches a tier ceiling, draft the upgrade pitch contextually.
- Cancellation winback: catch
customer.subscription.deletedand queue a tasteful winback the next day, informed by what the customer actually used.
I'm building the next layer right now: copying dunning-orchestrator.md into lifecycle-orchestrator.md and growing it into a single agent system that owns the entire SaaS revenue cycle. The leverage of one well-shaped agent is much higher than I expected when I started.
For metering on the other side of billing, I covered it in Antigravity AI Agents and Stripe Meter Events for Usage-Based Billing. And for revenue strategy across the funnel, Antigravity Subscription Revenue Optimization (Advanced) pairs nicely with this pipeline.
One Thing to Do Today
If you read this far, here's the single move I'd recommend:
Open Stripe's Reports -> Revenue -> Failed payments and look at the past 30 days. Note the count and the top three reasons.
That number tells you whether dunning is worth automating yet for your product. A handful of failures per month is fine to handle by hand. Once it crosses a few dozen, the investment described in this article starts paying back fast. That's exactly the scale at which I started writing my own — the day after I first opened that report.