
Zero-Downtime DB Migration: The Terror of Schema Changes in Production
Adding a column to a production table locked it for 5 minutes. Learning zero-downtime migration strategies the hard way.

Adding a column to a production table locked it for 5 minutes. Learning zero-downtime migration strategies the hard way.
Foundation of DB Design. Splitting tables to prevent Anomalies. 1NF, 2NF, 3NF explained simply.

How to deploy without shutting down servers. Differences between Rolling, Canary, and Blue-Green. Deep dive into Database Rollback strategies, Online Schema Changes, AWS CodeDeploy integration, and Feature Toggles.

Solving server waste at dawn and crashes at lunch. Understanding Auto Scaling vs Serverless through 'Taxi Dispatch' and 'Pizza Delivery' analogies. Plus, cost-saving tips using Spot Instances.

Why your server isn't hacked. From 'Packet Filtering' checking ports/IPs to AWS Security Groups. Evolution of Firewalls.

Friday, 5 PM. "Just need to add one column..." I thought. A simple task—adding an email_verified column to the users table. Modified the Prisma schema, ran npx prisma migrate deploy. Then the service froze for five minutes.
Slack exploded with error alerts. "Database connection timeout", "Query timeout", "503 Service Unavailable". I had no idea what happened. Just adding one column, yet the entire table with 1 million rows was locked. While ALTER TABLE executed, every read/write operation hung in waiting state.
That day I learned: schema changes in production are more dangerous than code deployments. Code can be rolled back, but when the DB locks, the entire service stops. Zero-downtime migration wasn't optional—it was essential.
The core issue was table locks. In PostgreSQL, certain DDL operations acquire an ACCESS EXCLUSIVE lock on the entire table. When this lock is held:
My migration looked like this:
ALTER TABLE users
ADD COLUMN email_verified BOOLEAN NOT NULL DEFAULT false;
This single line updated all 1 million rows. While writing the false default value to each row, the table was locked. Like locking the library entrance while reorganizing all the books.
Initially I didn't get it. "Why does adding one column take so long?" But from the DB's perspective, it made sense. It needs to physically write new data to each row, and if other operations interfere, data consistency breaks. So it locks.
The key insight: DB migrations must be gradual. Try to change everything at once, you get locks and service downtime.
The heart of zero-downtime migration is the Expand-Contract pattern. Like crossing a bridge by placing one foot forward before lifting the other, you add the new schema before removing the old one, step by step.
Wrong approach (change all at once):-- ❌ Dangerous: causes table lock
ALTER TABLE users
ADD COLUMN email_verified BOOLEAN NOT NULL DEFAULT false;
Right approach (3 separate steps):
-- Step 1: Add column as nullable (minimal lock)
ALTER TABLE users
ADD COLUMN email_verified BOOLEAN;
-- Step 2: Backfill existing data in batches (application level)
UPDATE users
SET email_verified = false
WHERE email_verified IS NULL
LIMIT 1000; -- Process in batches
-- Step 3: Add NOT NULL constraint (after all data filled)
ALTER TABLE users
ALTER COLUMN email_verified SET NOT NULL;
-- Step 4: Set default value (metadata only)
ALTER TABLE users
ALTER COLUMN email_verified SET DEFAULT false;
In the first step, adding as nullable means PostgreSQL only modifies metadata. It doesn't touch actual rows, so locks are minimal. Then backfilling in batches keeps the service running.
The pattern's essence is separating physical changes from logical changes. Like remodeling a building by setting up temporary partitions and working on one side at a time.
Creating indexes is also risky. Regular CREATE INDEX scans the entire table while holding a SHARE lock. Reads work but writes are blocked.
Wrong approach:-- ❌ Dangerous: blocks write operations
CREATE INDEX idx_users_email ON users(email);
Right approach:
-- ✅ Safe: allows concurrent access
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
Using CONCURRENTLY makes index creation take twice as long, but the service doesn't stop. The tradeoff is clear: trade time for stability.
Important note: CONCURRENTLY can't be used inside transactions. With ORMs like Prisma, you need raw SQL:
// Creating CONCURRENTLY index in Prisma
await prisma.$executeRawUnsafe(
'CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email ON users(email)'
);
Renaming columns seems simple but is dangerous. RENAME COLUMN only changes metadata so it's fast, but the problem is application code. If the column name changes before code deploys, you get immediate errors.
-- Step 1: Add new column
ALTER TABLE users ADD COLUMN full_name TEXT;
-- Step 2: Copy existing data (batch)
UPDATE users SET full_name = name WHERE full_name IS NULL;
// Step 3: Write to both columns in application code
await prisma.user.update({
where: { id },
data: {
name: newName, // Old column
full_name: newName, // New column
},
});
-- Step 4: After deployment, stop reading old column
-- (modify code to read only full_name)
-- Step 5: Stop writing to old column
-- (remove name from code)
-- Step 6: Drop old column
ALTER TABLE users DROP COLUMN name;
This process takes days. No rush. Stability matters more than speed. Like migrating a system by moving cables one by one.
Deleting columns also proceeds in reverse:
-- Only execute in final step
ALTER TABLE users DROP COLUMN old_column;
DROP COLUMN in PostgreSQL only changes metadata so it's fast, but rollback is impossible. So we wait. If problems arise, we can roll back code to resume reading/writing.
My settled workflow with Supabase:
# 1. Write migration locally
supabase migration new add_email_verified
# 2. Edit SQL file (apply zero-downtime strategy)
# supabase/migrations/20260205_add_email_verified.sql
-- Add column as nullable
ALTER TABLE users
ADD COLUMN IF NOT EXISTS email_verified BOOLEAN;
-- Index with CONCURRENTLY
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email_verified
ON users(email_verified)
WHERE email_verified = true;
# 3. Test locally
supabase db reset
# 4. Deploy to staging
supabase db push --db-url $STAGING_DB_URL
# 5. Deploy to production (after deep breath)
supabase db push --db-url $PRODUCTION_DB_URL
The key is testing in staging first. Clone production data, run the migration, measure how long locks last.
Always prepare rollback scripts for when migrations fail:
-- 20260205_add_email_verified_rollback.sql
DROP INDEX CONCURRENTLY IF EXISTS idx_users_email_verified;
ALTER TABLE users DROP COLUMN IF EXISTS email_verified;
Supabase and Prisma don't support automatic rollback. So you prepare manually. I developed a habit of writing the rollback script before executing the migration.
Things to never do in production:
ALTER TABLE ... ADD COLUMN ... NOT NULL DEFAULT ... (causes locks)CREATE INDEX without CONCURRENTLY (blocks writes)ALTER TABLE ... RENAME COLUMN (needs code sync)DROP COLUMN immediately (can't rollback)Safe operations:
ADD COLUMN nullable (metadata only)CREATE INDEX CONCURRENTLY (allows concurrent access)The biggest change from learning zero-downtime migration was my concept of time. I used to think "just change it now", but now I think "let's change it over several days".
Three core principles:DB migration is not a sprint, it's mountain climbing. Rush and slip, everything collapses. Slow, steady, one step at a time. That's ultimately the fastest path.
Now I check a list before migrations:
The last item seems like a joke, but it's real. No DB migrations on Friday afternoons. Not if you want to keep your weekend intact.