Zero-Downtime Database Migrations — The Patterns That Actually Work
The 4 AM Migration Horror Story
A client's team scheduled a database migration for 4 AM on a Saturday. "Add a NOT NULL column to the orders table. Should take 5 minutes." The orders table had 47 million rows. The ALTER TABLE locked the entire table. The migration ran for 3 hours and 22 minutes. During that time, every order, every checkout, every dashboard query — blocked. $180K in lost revenue.
Zero-downtime migrations aren't a luxury. They're a requirement for any production system processing real transactions.
The Rules of Zero-Downtime Migrations
Rule 1: Never lock a table with active traffic
Rule 2: Every migration must be backward-compatible
Rule 3: Deploy in phases — code first, then schema, then cleanup
Rule 4: Always have a rollback plan that doesn't require another migration
Pattern 1: The Expand-Contract Pattern
The safest approach for schema changes. Three phases:
Phase 1 — EXPAND: Add new column/table (nullable, no constraints)
→ Old code still works, new column is just ignored
Phase 2 — MIGRATE: Backfill data, deploy code that writes to both
→ Both old and new schemas work simultaneously
Phase 3 — CONTRACT: Remove old column/table, add constraints
→ Only after all code uses the new schema
Example: Renaming a column from name to full_name:
-- Phase 1: EXPAND (deploy this migration)
ALTER TABLE customers ADD COLUMN full_name TEXT;
-- Phase 2: MIGRATE (deploy code change + backfill)
-- Code now writes to BOTH columns
UPDATE customers SET full_name = name WHERE full_name IS NULL;
-- Run in batches for large tables (see Pattern 3)
-- Phase 3: CONTRACT (after all code uses full_name)
ALTER TABLE customers DROP COLUMN name;
ALTER TABLE customers ALTER COLUMN full_name SET NOT NULL;Each phase is a separate deployment. If anything goes wrong, you roll back to the previous phase — no data loss, no downtime.
Pattern 2: Adding NOT NULL Columns Safely
The migration that caused the 4 AM outage. Here's how to do it safely:
-- ❌ DANGEROUS: Locks the entire table while it adds and backfills
ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending';
-- ✅ SAFE: Three-step approach
-- Step 1: Add nullable column (instant, no lock)
ALTER TABLE orders ADD COLUMN status TEXT;
-- Step 2: Set default for new rows (instant, no lock)
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
-- Step 3: Backfill existing rows in batches
DO $$
DECLARE
batch_size INT := 10000;
total_updated INT := 0;
BEGIN
LOOP
UPDATE orders
SET status = 'pending'
WHERE id IN (
SELECT id FROM orders
WHERE status IS NULL
LIMIT batch_size
FOR UPDATE SKIP LOCKED
);
GET DIAGNOSTICS total_updated = ROW_COUNT;
EXIT WHEN total_updated = 0;
RAISE NOTICE 'Updated % rows', total_updated;
PERFORM pg_sleep(0.1); -- Brief pause to reduce load
END LOOP;
END $$;
-- Step 4: Add NOT NULL constraint (after backfill completes)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;The FOR UPDATE SKIP LOCKED clause is critical — it prevents the backfill from blocking concurrent transactions.
Pattern 3: Large Table Backfills
For tables with millions of rows, batch processing is non-negotiable:
async function backfillInBatches(options: {
table: string;
setClause: string;
whereClause: string;
batchSize: number;
delayMs: number;
}) {
let totalUpdated = 0;
let batchCount = 0;
while (true) {
const result = await db.query(`
WITH batch AS (
SELECT id FROM ${options.table}
WHERE ${options.whereClause}
LIMIT ${options.batchSize}
FOR UPDATE SKIP LOCKED
)
UPDATE ${options.table}
SET ${options.setClause}
WHERE id IN (SELECT id FROM batch)
`);
totalUpdated += result.rowCount;
batchCount++;
console.log(`Batch ${batchCount}: ${result.rowCount} rows (${totalUpdated} total)`);
if (result.rowCount === 0) break;
await sleep(options.delayMs); // Reduce database load
}
return { totalUpdated, batchCount };
}
// Usage
await backfillInBatches({
table: "orders",
setClause: "status = 'pending'",
whereClause: "status IS NULL",
batchSize: 5000,
delayMs: 100, // 100ms pause between batches
});Pattern 4: Creating Indexes Without Locking
Standard CREATE INDEX locks the table for writes. Use CONCURRENTLY:
-- ❌ LOCKS the table during index creation
CREATE INDEX idx_orders_status ON orders (status);
-- ✅ Builds index without blocking writes
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);The CONCURRENTLY option takes longer but doesn't block any queries. One caveat: it can't run inside a transaction block, and if it fails, you need to drop the invalid index and retry.
The Migration Checklist
Before running any migration on production:
Pre-Migration:
□ Tested on a production-sized dataset (not just dev with 100 rows)
□ Measured lock time — any operation > 1 second needs the safe pattern
□ Backfill script tested and idempotent (safe to run multiple times)
□ Rollback plan documented and tested
□ Monitoring alerts set up for query latency and lock waits
During Migration:
□ Run during lowest-traffic window (but don't depend on low traffic)
□ Monitor active locks: SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock'
□ Monitor replication lag if using replicas
□ Have a kill switch ready for long-running queries
Post-Migration:
□ Verify data integrity (row counts, null checks, constraint violations)
□ Monitor application error rates for 30 minutes
□ Run ANALYZE on affected tables to update query planner statistics
□ Document the migration for the team
Database migrations don't have to be scary. With the expand-contract pattern, batched backfills, and concurrent index creation, you can modify production schemas with millions of rows while your application keeps serving traffic. The key is patience — three safe deploys beats one dangerous one.