Zero-Downtime Database Migrations with Antigravity: The Expand-Contract Pattern in Production

"I shipped an ALTER TABLE to production. Five minutes of downtime." That was me, on a SaaS I run myself. I'd picked a quiet hour, but a long-running query happened to be holding a row, and the migration sat there waiting until the entire table was effectively frozen for everyone else.

Lucky for me, the user count was small enough that a Slack apology covered it. Less lucky was the lesson I had to internalize: production databases punish the assumption that "small change" and "small risk" are the same thing. This guide walks through the discipline I now use—the Expand-Contract pattern, paired with Antigravity's AI assistance—to ship the kind of schema changes that used to keep me up at night.

By the end, you'll be able to run an ID-type swap from BIGINT to UUID, peel an email column into a separate table, or split a bloated users table into users plus user_profiles, all without a single dropped request.

Why Plain ALTER TABLE Quietly Breaks Production

PostgreSQL's ALTER TABLE doesn't always look dangerous, but several operations grab an ACCESS EXCLUSIVE lock on the entire table. While that lock is held, even SELECT queries pile up in the wait queue. On a million-row table, a column-type rewrite can hold that lock for tens of seconds to minutes—long enough for connection pools to saturate and your app to topple over from cascading timeouts.

That rhythm is the Expand-Contract pattern (sometimes called Parallel Change). It splits a breaking change into three phases that each leave the system in a working state.

The Three Phases at a Glance

Phase 1 — Expand: add the new schema alongside the old. The application dual-writes to both. Nothing breaks because nothing is removed.
Phase 2 — Migrate: backfill existing data into the new schema. Then gradually shift reads from old to new behind a feature flag.
Phase 3 — Contract: stop writing to the old schema, rename to make accidental references obvious, and only then drop the old columns or tables.

The boundary between phases is where you ship and observe. If anything goes wrong, the previous phase is still working—you can pause indefinitely instead of rolling back under fire.

Phase 1: Expand — Designing Schemas That Coexist

Let's walk through changing users.id from BIGINT to UUID. The first migration introduces the new column without touching the old one.

-- migration_001_expand.sql
-- Phase 1: add the new uuid column. No NOT NULL yet—we backfill later.
ALTER TABLE users
  ADD COLUMN id_new UUID DEFAULT gen_random_uuid();
 
-- Backfill is fine inline only on small tables.
-- Large tables go through Phase 2's chunked backfill instead.
UPDATE users SET id_new = gen_random_uuid() WHERE id_new IS NULL;
 
-- Add a unique index without blocking writers.
CREATE UNIQUE INDEX CONCURRENTLY users_id_new_unique
  ON users(id_new);

The unsung hero here is CREATE INDEX CONCURRENTLY. A regular CREATE INDEX locks the entire table; CONCURRENTLY builds the index without blocking writes. It takes longer in wall-clock time, which is exactly the trade you want in production: spend hours of build time to keep the service available.

Next comes the application-side dual write. Open Antigravity's Inline Edit (Cmd+I) on your createUser function and prompt: "Update this function to write both id and id_new. Don't change the public API shape." You'll get something like this back:

// app/server/users/create.ts
import { randomUUID } from "node:crypto";
import { db } from "@/lib/db";
 
interface CreateUserInput {
  email: string;
  name: string;
}
 
export async function createUser(input: CreateUserInput) {
  // Phase 1: write both the legacy serial id and the new UUID.
  const newUuid = randomUUID();
 
  const [user] = await db
    .insertInto("users")
    .values({
      email: input.email,
      name: input.name,
      id_new: newUuid,
      // `id` keeps auto-incrementing; we don't override it.
    })
    .returning(["id", "id_new", "email", "name"])
    .execute();
 
  return user;
}

Once this ships, every new signup creates a row with both columns populated and consistent. Existing rows still need their id_new filled in, which is what Phase 2 is for.

Phase 2: Migrate — Backfilling Without a Maintenance Window

Running UPDATE users SET id_new = gen_random_uuid() on a million-row table is the same trap as ALTER TABLE in disguise: a single statement, a long lock, an angry pager. The fix is chunked backfill that processes a small slice at a time.

// scripts/backfill_user_uuid.ts
import { db } from "@/lib/db";
 
const CHUNK_SIZE = 1000;
const SLEEP_MS = 200;
 
async function backfillUuid() {
  let lastId = 0;
  let processed = 0;
 
  while (true) {
    const updated = await db
      .updateTable("users")
      .set((eb) => ({ id_new: eb.fn("gen_random_uuid") }))
      .where("id", ">", lastId)
      .where("id_new", "is", null)
      .where("id", "in",
        db.selectFrom("users")
          .select("id")
          .where("id", ">", lastId)
          .where("id_new", "is", null)
          .orderBy("id", "asc")
          .limit(CHUNK_SIZE)
      )
      .returning(["id"])
      .execute();
 
    if (updated.length === 0) break;
 
    lastId = Math.max(...updated.map((u) => Number(u.id)));
    processed += updated.length;
 
    console.log(`Processed: ${processed}, lastId: ${lastId}`);
 
    // Deliberate pause so production traffic still gets airtime.
    await new Promise((r) => setTimeout(r, SLEEP_MS));
  }
 
  console.log(`Backfill complete. Total processed: ${processed}`);
}
 
backfillUuid().catch((err) => {
  console.error("Backfill failed:", err);
  process.exit(1);
});

Three details earn their keep here.

First, the script uses lastId as a cursor, so it's resumable. If the process dies from OOM or a transient network fault, you start it again with the last printed lastId and continue. No duplicate work, no skipped rows.

Second, the deliberate SLEEP_MS between batches is a feature, not a bug. Running the backfill at full tilt would saturate the database and starve user requests. Inserting a 200 ms breath gives the system room to handle live traffic. Slower backfill, healthier service—usually the right trade.

Third, the chatty progress logs are operational sanity. Long-running batch jobs are stressful when you can't tell whether they're working or stuck. Print often. Print numbers you can grep for.

This is also where I lean hard on Antigravity's Plan Mode. I'll paste the script in and ask: "Review this backfill for production safety. List the top three risks and how to mitigate each." The model reliably surfaces lock-contention risks, edge cases around gen_random_uuid() collisions (vanishingly rare but worth a unique constraint), and deadlock retry strategy. I run this review every time before pressing go.

Phase 2 Continued: Shifting Reads with Feature Flags

Once the backfill finishes and id_new is populated for every row, you start moving reads to the new column. Feature flags are how you do this without committing.

// app/server/users/get.ts
import { db } from "@/lib/db";
import { isFeatureEnabled } from "@/lib/feature-flags";
 
export async function getUserById(idOrUuid: string | number) {
  const useUuid = await isFeatureEnabled("users.read.use_uuid_id");
 
  if (useUuid && typeof idOrUuid === "string") {
    return db
      .selectFrom("users")
      .selectAll()
      .where("id_new", "=", idOrUuid)
      .executeTakeFirst();
  }
 
  // Legacy path (numeric id).
  const numericId = typeof idOrUuid === "string" ? Number(idOrUuid) : idOrUuid;
  return db
    .selectFrom("users")
    .selectAll()
    .where("id", "=", numericId)
    .executeTakeFirst();
}

Roll the flag from 0% to 1%, 10%, 50%, 100% on whatever cadence your monitoring lets you trust. I personally hold each step for 24 hours and watch error rate plus p95 latency. If anything moves, I freeze the rollout and investigate before going further. The flag is a brake, and the goal is to never need it.

If you don't have a feature-flag system yet, the companion piece Building Feature Flag Driven Development with Antigravity covers a working baseline you can drop into a Cloudflare Worker app in an afternoon.

Phase 3: Contract — Removing the Old Schema, Carefully

After reads have been at 100% on the new column for a day or two with no anomalies, it's Contract time. You still don't go straight to DROP COLUMN. The pattern is rename first, drop later.

-- migration_002_contract_step1.sql
-- Step 1: prep for stopping writes to the legacy column.
-- Merge the application-side PR that removes explicit writes to `id` first.
 
-- Loosen the legacy column so a stray write doesn't 500.
ALTER TABLE users ALTER COLUMN id DROP NOT NULL;

Pause here for a week of production observation. Use that time to grep your logs for any reference to the old column. The goal is verifiable silence: nothing in your codebase or background jobs is touching id anymore.

-- migration_003_contract_step2.sql
-- Step 2: rename so any forgotten reference fails loudly.
ALTER TABLE users RENAME COLUMN id TO id_old_deprecated;
ALTER TABLE users RENAME COLUMN id_new TO id;

Renames are metadata-only operations and complete instantly. The new column is now the canonical id, and the old one wears a name that makes it obvious it's not a load-bearing piece anymore. If any forgotten code path still touches id_old_deprecated, it errors immediately—a cheap, fast signal you'd rather get now than three weeks from now.

-- migration_004_contract_step3.sql
-- Step 3: drop after one to two more weeks of zero references.
ALTER TABLE users DROP COLUMN id_old_deprecated;

DROP COLUMN itself is metadata—the row data gets reclaimed lazily by VACUUM. The lock is brief.

Three Traps I Personally Stepped In

Theory is comfortable, practice is humbling. Here are three traps from my own production migrations.

Trap 1: The Trigger Temptation

When dual-writing in Phase 1, it's tempting to skip the application changes and use a database trigger:

-- Looks elegant. Causes pain in production.
CREATE TRIGGER sync_id_new
  BEFORE INSERT ON users
  FOR EACH ROW
  EXECUTE FUNCTION generate_id_new();

Triggers do guarantee synchronization, but they make debugging miserable. Behavior happens inside the database that doesn't show up in your application logs. When something looks wrong, you're now debugging a system you can't trace through your normal observability stack. Explicit application-layer dual-writes show up in traces, surface in logs, and stay debuggable. Pay the small cost of writing the application change.

Trap 2: Long-Running Transactions Sabotage CONCURRENTLY

CREATE INDEX CONCURRENTLY waits for any transaction that started before it. If a long-running batch job, a forgotten psql session, or a reporting query is holding a transaction open, your "non-blocking" index build will block on it.

-- Run this before any concurrent index build.
SELECT pid, now() - xact_start AS duration, query
  FROM pg_stat_activity
  WHERE state = 'active' AND xact_start IS NOT NULL
  ORDER BY duration DESC
  LIMIT 10;

If anything looks suspicious, track it down before starting. I've watched teammates wonder why CREATE INDEX CONCURRENTLY was "hanging" for two hours when the answer was a forgotten BI dashboard query somewhere.

Trap 3: ORM Schema Cache After Renames

Most ORMs—Prisma, Kysely, Drizzle—cache schema metadata at boot. After Phase 3's rename, if your application starts up against a stale schema, you'll see a stream of "column does not exist" errors that look terrifying but are really just a regeneration step you skipped.

The fix is to bake the regeneration into your deploy. For Prisma, that means prisma db pull && prisma generate in the build step, with the regenerated client committed or copied into the deploy artifact. For Kysely, run kysely-codegen against your CI database. For Drizzle, run drizzle-kit pull.

I once burned 30 minutes of degraded service to this exact issue. Now I keep an Antigravity Custom Command called migration-postcheck that runs the regeneration plus a smoke-test query on every deploy that touches a migration.

Putting It Together: My Antigravity Workflow

Here's the actual sequence I run when I'm about to ship a Phase 1 migration.

I open Antigravity in Plan Mode and prompt: "Build me an Expand-Contract migration plan for changing users.id from BIGINT to UUID. Include the SQL for each phase, the application code changes, and a rollback plan keyed to each phase." Plan Mode is much better than Fast Mode here because it carries context across phases—dependencies don't get dropped between turns.

I take each generated SQL block and ask: "Evaluate the lock impact of this statement on a 1M-row table. Suggest a non-locking alternative if there is one." Antigravity's grasp of PostgreSQL lock behavior is solid enough that it consistently catches missing CONCURRENTLY keywords and ACCESS EXCLUSIVE operations I'd have shipped without thinking.

For the deploy itself, my project's AGENTS.md includes a checklist that the Inline Edit pass walks through before any DB-touching PR merges:

Confirm no long-running transactions in pg_stat_activity.
Confirm a backup completed within the last 24 hours.
Note current and peak QPS on the affected table.
Stage rollback SQL and dry-run it on staging.
Verify the feature flag default is OFF and the rollout schedule is documented.

The point isn't the specific list. It's that the discipline lives in a file the AI can enforce, not in human memory.

Where to Start Tomorrow

The Expand-Contract pattern is conceptually simple but rewards the implementation details. The honest first step is to pick one breaking schema change you've been postponing because it scared you—and write it down. Then split it into Expand, Migrate, and Contract phases on paper. For each phase, sketch what changes in the application, what SQL goes to the database, and what the rollback looks like.

Before you write a single line of real code, drop the plan into Antigravity's Plan Mode and ask: "Review this migration plan. Name three blind spots." The third one usually surprises you.

The shape of production work isn't fearlessness. It's having a method that lets fear become useful information instead of an excuse to procrastinate. Expand-Contract is one of the most reliable methods I've found, and pairing it with Antigravity has made it noticeably easier to ship the changes I would have skipped a year ago.

Beyond ID Swaps: Two More Real-World Examples

The same three-phase rhythm applies to almost every breaking change you can imagine. Two examples that come up often in solo-developer SaaS work.

Renaming a Column Without Downtime

Renaming users.email_address to users.email looks trivial. The naive ALTER TABLE users RENAME COLUMN email_address TO email is metadata-only and fast, so what could go wrong? The application. The moment you run that rename, every running instance with the old schema cached starts erroring. Even if your ORM picks up the new name on the next deploy, you have a window where some pods see the old name and some see the new one.

Expand-Contract solves this:

Expand: add email as a new column. Application reads from email_address if present and falls back to email; writes go to both.
Migrate: backfill email from email_address. Flip the flag so reads come from email.
Contract: stop writing to email_address. Rename email_address to email_address_deprecated. Drop after the observation window.

It's slower, but the application never sees a column that disappeared mid-request.

Splitting a Table

A common growth pattern: your users table accumulates profile fields—bio, avatar URL, social links, preferred language—until it becomes the dumping ground for everything user-related. You want to split it into users (auth-relevant data) and user_profiles (everything else).

Phase 1 expand: create user_profiles with the new columns plus a foreign key to users.id. Application writes to both tables on signup and profile update. Reads still come from users because the data hasn't been migrated yet.

Phase 2 migrate: chunked backfill from users into user_profiles for every existing user. Once complete, switch profile reads to user_profiles behind a flag, gradually.

Phase 3 contract: stop writing the profile fields to users. Rename them with a _deprecated suffix. Drop after observation.

What makes this work is that during Phase 2, the join users JOIN user_profiles returns the same data the old single-table query did, so feature parity is maintained throughout. Antigravity's context-aware refactoring is genuinely useful here: ask it to "find all places that read profile fields from users and update them to join user_profiles behind the user_profiles_split flag," and it produces the kind of mechanical change that would take an afternoon by hand.

How This Compares to Online Schema Change Tools

If you've worked with MySQL at larger scales, you've probably encountered tools like pt-online-schema-change (Percona) or gh-ost (GitHub's online schema migration tool). These tools automate a similar Expand-Contract dance under the hood: they create a shadow table, copy data over with triggers keeping it in sync, and then atomically rename.

PostgreSQL doesn't have a direct equivalent, but tools like pg_repack and pgroll (Xata's open-source migration tool) take adjacent approaches. pgroll is particularly interesting—it implements multi-version schemas so old and new code can coexist during a deploy. It's worth evaluating if you do migrations frequently.

Why bother with the manual approach this article describes when tools exist? Two reasons.

First, manual control is easier to reason about when something goes wrong. With pgroll, when a migration misbehaves, you're debugging the tool's view-based abstraction in addition to your data. With manual Expand-Contract, every artifact—the new column, the dual-write, the flag—is something you wrote and can step through.

Second, the manual approach forces you to actually decide about each phase. Tooling can paper over decisions like "how long should I observe before contracting?" Doing it yourself makes those decisions explicit, which over time builds operator intuition you'll need anyway.

I keep pgroll on my radar for projects that hit a tipping point in migration frequency, but for most solo-dev SaaS work, manual Expand-Contract with Antigravity's assistance is the sweet spot.

A Production Checklist You Can Steal

Print this. Tape it next to your monitor. I'm only half-kidding.

Before Phase 1 (Expand):

Migration SQL reviewed by Antigravity Plan Mode for lock impact
All CREATE INDEX statements use CONCURRENTLY
All ALTER TABLE statements assessed for ACCESS EXCLUSIVE locks
Rollback SQL written and dry-run on staging
Application dual-write code reviewed for correctness on both write paths
Backups verified within last 24 hours

Before Phase 2 (Migrate):

Backfill script tested on staging with a similarly-sized data set
Cursor-based resumability verified (kill the script mid-run, restart, verify no gaps)
Sleep interval tuned against staging traffic to confirm acceptable impact
Feature flag system in place and verified working at 0% rollout
Monitoring dashboards updated to track read-path errors and latency on the new column

Before Phase 3 (Contract):

Read flag at 100% for at least 24 hours with no anomalies
Application code grepped for any reference to the old schema name
Background jobs, cron tasks, and reports audited for old-name references
ORM schema regeneration baked into the deploy pipeline
Rename PR merged before the rename SQL runs—not after

After Each Phase:

Deploy a single phase per release. Never combine phases in one deploy.
Wait the observation period (24-48 hours minimum) before starting the next phase
Document any deviation from the plan in a runbook so the next migration starts smarter

When Things Go Sideways: Rollback Patterns

Even with all this preparation, real production has a way of finding new failure modes. Knowing how to roll back at each phase keeps you sane.

Rolling back from Phase 1: This is the easy case. The new column exists but the application still works against the old column. Drop the dual-write code, ship a deploy, and either keep the column for a future attempt or drop it. No data is lost because the new column was additive.

Rolling back during Phase 2: The dangerous middle. Reads might be partially shifted. The right move is to flip the feature flag to 0% (back to the old column for reads), keep the dual-write running, and investigate. The new column's data isn't lost—it's just dormant. You can restart the migration once you understand what happened.

Rolling back from Phase 3: This is the hardest case because the rename is real. If you renamed id to id_old_deprecated and id_new to id, the rollback is to rename them back. The data is intact. The pain is that any deploy in between assumed the new naming, so you may need to revert application deploys too.

I write the rollback SQL at the same time as the forward migration SQL. Two files, side by side. If a migration is too complex to write a clean rollback for, that's a signal the migration is too complex—simplify it before shipping.

A Note on Writing Migrations With AI

I want to say something honest about Antigravity's role in this kind of work. The model is excellent at the mechanical parts: generating the SQL skeletons, suggesting safer variants, catching obvious lock issues. It is not a substitute for understanding what your data layer actually does under load.

I treat Antigravity's output the way I'd treat a junior engineer's pull request: helpful, often correct, occasionally subtly wrong in ways that only show up in production. The Plan Mode review is invaluable for catching the obvious, but I never ship a migration I haven't personally walked through line by line.

What changed for me with Antigravity isn't that I think less about migrations. It's that I write more of them. The activation energy of "design the Expand-Contract plan, write the SQL, write the dual-write code, write the backfill, write the rollback, document everything" used to be high enough that I postponed schema changes I knew I needed. Now the activation energy is low enough that I do them, and the process makes me better at them every time.