Zero-Downtime DB Migration: The Terror of Schema Changes in Production

Prologue: Friday 5 PM Nightmare

Friday, 5 PM. "Just need to add one column..." I thought. A simple task—adding an email_verified column to the users table. Modified the Prisma schema, ran npx prisma migrate deploy. Then the service froze for five minutes.

Slack exploded with error alerts. "Database connection timeout", "Query timeout", "503 Service Unavailable". I had no idea what happened. Just adding one column, yet the entire table with 1 million rows was locked. While ALTER TABLE executed, every read/write operation hung in waiting state.

That day I learned: schema changes in production are more dangerous than code deployments. Code can be rolled back, but when the DB locks, the entire service stops. Zero-downtime migration wasn't optional—it was essential.

Aha! Moment: Why Does the Database Lock?

The core issue was table locks. In PostgreSQL, certain DDL operations acquire an ACCESS EXCLUSIVE lock on the entire table. When this lock is held:

All new read/write requests enter waiting state
The DDL also waits until existing queries complete
The larger the table (more rows), the longer the lock persists

My migration looked like this:

ALTER TABLE users
ADD COLUMN email_verified BOOLEAN NOT NULL DEFAULT false;

This single line updated all 1 million rows. While writing the false default value to each row, the table was locked. Like locking the library entrance while reorganizing all the books.

Initially I didn't get it. "Why does adding one column take so long?" But from the DB's perspective, it made sense. It needs to physically write new data to each row, and if other operations interfere, data consistency breaks. So it locks.

The key insight: DB migrations must be gradual. Try to change everything at once, you get locks and service downtime.

Deep Dive: Zero-Downtime Migration Strategies

1. Expand-Contract Pattern: Expand Then Shrink

The heart of zero-downtime migration is the Expand-Contract pattern. Like crossing a bridge by placing one foot forward before lifting the other, you add the new schema before removing the old one, step by step.

Wrong approach (change all at once):

-- ❌ Dangerous: causes table lock
ALTER TABLE users
ADD COLUMN email_verified BOOLEAN NOT NULL DEFAULT false;

Right approach (3 separate steps):

-- Step 1: Add column as nullable (minimal lock)
ALTER TABLE users
ADD COLUMN email_verified BOOLEAN;

-- Step 2: Backfill existing data in batches (application level)
UPDATE users
SET email_verified = false
WHERE email_verified IS NULL
LIMIT 1000;  -- Process in batches

-- Step 3: Add NOT NULL constraint (after all data filled)
ALTER TABLE users
ALTER COLUMN email_verified SET NOT NULL;

-- Step 4: Set default value (metadata only)
ALTER TABLE users
ALTER COLUMN email_verified SET DEFAULT false;

In the first step, adding as nullable means PostgreSQL only modifies metadata. It doesn't touch actual rows, so locks are minimal. Then backfilling in batches keeps the service running.

The pattern's essence is separating physical changes from logical changes. Like remodeling a building by setting up temporary partitions and working on one side at a time.

2. Index Creation: CONCURRENTLY Is Your Lifeline

Creating indexes is also risky. Regular CREATE INDEX scans the entire table while holding a SHARE lock. Reads work but writes are blocked.

Wrong approach:

-- ❌ Dangerous: blocks write operations
CREATE INDEX idx_users_email ON users(email);

Right approach:

-- ✅ Safe: allows concurrent access
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

Using CONCURRENTLY makes index creation take twice as long, but the service doesn't stop. The tradeoff is clear: trade time for stability.

Important note: CONCURRENTLY can't be used inside transactions. With ORMs like Prisma, you need raw SQL:

// Creating CONCURRENTLY index in Prisma
await prisma.$executeRawUnsafe(
  'CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email ON users(email)'
);

3. Column Renaming: Dual-Write Strategy

Renaming columns seems simple but is dangerous. RENAME COLUMN only changes metadata so it's fast, but the problem is application code. If the column name changes before code deploys, you get immediate errors.

Dual-Write strategy:

-- Step 1: Add new column
ALTER TABLE users ADD COLUMN full_name TEXT;

-- Step 2: Copy existing data (batch)
UPDATE users SET full_name = name WHERE full_name IS NULL;

// Step 3: Write to both columns in application code
await prisma.user.update({
  where: { id },
  data: {
    name: newName,      // Old column
    full_name: newName, // New column
  },
});

-- Step 4: After deployment, stop reading old column
-- (modify code to read only full_name)

-- Step 5: Stop writing to old column
-- (remove name from code)

-- Step 6: Drop old column
ALTER TABLE users DROP COLUMN name;

This process takes days. No rush. Stability matters more than speed. Like migrating a system by moving cables one by one.

4. Column Deletion: 4-Step Process

Deleting columns also proceeds in reverse:

Stop reading in code → deploy
Stop writing in code → deploy
Wait several days (ensure rollback window)
Drop column

-- Only execute in final step
ALTER TABLE users DROP COLUMN old_column;

DROP COLUMN in PostgreSQL only changes metadata so it's fast, but rollback is impossible. So we wait. If problems arise, we can roll back code to resume reading/writing.

5. Supabase Migration Workflow

My settled workflow with Supabase:

# 1. Write migration locally
supabase migration new add_email_verified

# 2. Edit SQL file (apply zero-downtime strategy)
# supabase/migrations/20260205_add_email_verified.sql

-- Add column as nullable
ALTER TABLE users
ADD COLUMN IF NOT EXISTS email_verified BOOLEAN;

-- Index with CONCURRENTLY
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email_verified
ON users(email_verified)
WHERE email_verified = true;

# 3. Test locally
supabase db reset

# 4. Deploy to staging
supabase db push --db-url $STAGING_DB_URL

# 5. Deploy to production (after deep breath)
supabase db push --db-url $PRODUCTION_DB_URL

The key is testing in staging first. Clone production data, run the migration, measure how long locks last.

6. Rollback Strategy: Always Have a Plan B

Always prepare rollback scripts for when migrations fail:

-- 20260205_add_email_verified_rollback.sql
DROP INDEX CONCURRENTLY IF EXISTS idx_users_email_verified;
ALTER TABLE users DROP COLUMN IF EXISTS email_verified;

Supabase and Prisma don't support automatic rollback. So you prepare manually. I developed a habit of writing the rollback script before executing the migration.

7. Dangerous Operations Checklist

Things to never do in production:

❌ ALTER TABLE ... ADD COLUMN ... NOT NULL DEFAULT ... (causes locks)
❌ CREATE INDEX without CONCURRENTLY (blocks writes)
❌ ALTER TABLE ... RENAME COLUMN (needs code sync)
❌ DROP COLUMN immediately (can't rollback)
❌ Adding foreign keys without NOT VALID (full table scan)

Safe operations:

✅ ADD COLUMN nullable (metadata only)
✅ CREATE INDEX CONCURRENTLY (allows concurrent access)
✅ Backfill in batches (gradual updates)
✅ Dual-write strategy (use new/old columns simultaneously)

Summary: The Essence of Zero-Downtime Is Patience

The biggest change from learning zero-downtime migration was my concept of time. I used to think "just change it now", but now I think "let's change it over several days".

Three core principles:

Expand-Contract: Add → transition → remove in stages
Lock Minimization: nullable additions, CONCURRENTLY indexes
Rollback Plan: Always prepare a way back

DB migration is not a sprint, it's mountain climbing. Rush and slip, everything collapses. Slow, steady, one step at a time. That's ultimately the fastest path.

Now I check a list before migrations:

The last item seems like a joke, but it's real. No DB migrations on Friday afternoons. Not if you want to keep your weekend intact.

Zero-Downtime DB Migration: The Terror of Schema Changes in Production

Related Posts

Normalization: Kill Duplicates

Blue-Green Deployment: The Gold Standard for Zero Downtime

Auto Scaling: The Secret to 60% Server Cost Reduction (feat. AWS, K8s)

Firewall: The Grumpy Gatekeeper Protecting Your Server

Zero-Downtime DB Migration: The Terror of Schema Changes in Production

Prologue: Friday 5 PM Nightmare

Aha! Moment: Why Does the Database Lock?

Deep Dive: Zero-Downtime Migration Strategies

1. Expand-Contract Pattern: Expand Then Shrink

2. Index Creation: CONCURRENTLY Is Your Lifeline

3. Column Renaming: Dual-Write Strategy

4. Column Deletion: 4-Step Process

5. Supabase Migration Workflow

6. Rollback Strategy: Always Have a Plan B

7. Dangerous Operations Checklist

Summary: The Essence of Zero-Downtime Is Patience