Zero-Downtime Database Migrations — The Patterns That Actually Work

March 18, 2026·ScaledByDesign·

databasemigrationspostgresqldevopszero-downtime

The 4 AM Migration Horror Story

A client's team scheduled a database migration for 4 AM on a Saturday. "Add a NOT NULL column to the orders table. Should take 5 minutes." The orders table had 47 million rows. The ALTER TABLE locked the entire table. The migration ran for 3 hours and 22 minutes. During that time, every order, every checkout, every dashboard query — blocked. $180K in lost revenue.

Zero-downtime migrations aren't a luxury. They're a requirement for any production system processing real transactions.

The Rules of Zero-Downtime Migrations

Rule 1: Never lock a table with active traffic
Rule 2: Every migration must be backward-compatible
Rule 3: Deploy in phases — code first, then schema, then cleanup
Rule 4: Always have a rollback plan that doesn't require another migration

Pattern 1: The Expand-Contract Pattern

The safest approach for schema changes. Three phases:

Phase 1 — EXPAND: Add new column/table (nullable, no constraints)
  → Old code still works, new column is just ignored
  
Phase 2 — MIGRATE: Backfill data, deploy code that writes to both
  → Both old and new schemas work simultaneously
  
Phase 3 — CONTRACT: Remove old column/table, add constraints
  → Only after all code uses the new schema

Example: Renaming a column from name to full_name:

-- Phase 1: EXPAND (deploy this migration)
ALTER TABLE customers ADD COLUMN full_name TEXT;
 
-- Phase 2: MIGRATE (deploy code change + backfill)
-- Code now writes to BOTH columns
UPDATE customers SET full_name = name WHERE full_name IS NULL;
-- Run in batches for large tables (see Pattern 3)
 
-- Phase 3: CONTRACT (after all code uses full_name)
ALTER TABLE customers DROP COLUMN name;
ALTER TABLE customers ALTER COLUMN full_name SET NOT NULL;

Each phase is a separate deployment. If anything goes wrong, you roll back to the previous phase — no data loss, no downtime.

Pattern 2: Adding NOT NULL Columns Safely

The migration that caused the 4 AM outage. Here's how to do it safely:

-- ❌ DANGEROUS: Locks the entire table while it adds and backfills
ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending';
 
-- ✅ SAFE: Three-step approach
-- Step 1: Add nullable column (instant, no lock)
ALTER TABLE orders ADD COLUMN status TEXT;
 
-- Step 2: Set default for new rows (instant, no lock)
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
 
-- Step 3: Backfill existing rows in batches
DO $$
DECLARE
  batch_size INT := 10000;
  total_updated INT := 0;
BEGIN
  LOOP
    UPDATE orders
    SET status = 'pending'
    WHERE id IN (
      SELECT id FROM orders
      WHERE status IS NULL
      LIMIT batch_size
      FOR UPDATE SKIP LOCKED
    );
    
    GET DIAGNOSTICS total_updated = ROW_COUNT;
    EXIT WHEN total_updated = 0;
    
    RAISE NOTICE 'Updated % rows', total_updated;
    PERFORM pg_sleep(0.1);  -- Brief pause to reduce load
  END LOOP;
END $$;
 
-- Step 4: Add NOT NULL constraint (after backfill completes)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

The FOR UPDATE SKIP LOCKED clause is critical — it prevents the backfill from blocking concurrent transactions.

Pattern 3: Large Table Backfills

For tables with millions of rows, batch processing is non-negotiable:

async function backfillInBatches(options: {
  table: string;
  setClause: string;
  whereClause: string;
  batchSize: number;
  delayMs: number;
}) {
  let totalUpdated = 0;
  let batchCount = 0;
 
  while (true) {
    const result = await db.query(`
      WITH batch AS (
        SELECT id FROM ${options.table}
        WHERE ${options.whereClause}
        LIMIT ${options.batchSize}
        FOR UPDATE SKIP LOCKED
      )
      UPDATE ${options.table}
      SET ${options.setClause}
      WHERE id IN (SELECT id FROM batch)
    `);
 
    totalUpdated += result.rowCount;
    batchCount++;
 
    console.log(`Batch ${batchCount}: ${result.rowCount} rows (${totalUpdated} total)`);
 
    if (result.rowCount === 0) break;
    await sleep(options.delayMs);  // Reduce database load
  }
 
  return { totalUpdated, batchCount };
}
 
// Usage
await backfillInBatches({
  table: "orders",
  setClause: "status = 'pending'",
  whereClause: "status IS NULL",
  batchSize: 5000,
  delayMs: 100,  // 100ms pause between batches
});

Pattern 4: Creating Indexes Without Locking

Standard CREATE INDEX locks the table for writes. Use CONCURRENTLY:

-- ❌ LOCKS the table during index creation
CREATE INDEX idx_orders_status ON orders (status);
 
-- ✅ Builds index without blocking writes
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);

The CONCURRENTLY option takes longer but doesn't block any queries. One caveat: it can't run inside a transaction block, and if it fails, you need to drop the invalid index and retry.

The Migration Checklist

Before running any migration on production:

Pre-Migration:
  □ Tested on a production-sized dataset (not just dev with 100 rows)
  □ Measured lock time — any operation > 1 second needs the safe pattern
  □ Backfill script tested and idempotent (safe to run multiple times)
  □ Rollback plan documented and tested
  □ Monitoring alerts set up for query latency and lock waits

During Migration:
  □ Run during lowest-traffic window (but don't depend on low traffic)
  □ Monitor active locks: SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock'
  □ Monitor replication lag if using replicas
  □ Have a kill switch ready for long-running queries

Post-Migration:
  □ Verify data integrity (row counts, null checks, constraint violations)
  □ Monitor application error rates for 30 minutes
  □ Run ANALYZE on affected tables to update query planner statistics
  □ Document the migration for the team

Database migrations don't have to be scary. With the expand-contract pattern, batched backfills, and concurrent index creation, you can modify production schemas with millions of rows while your application keeps serving traffic. The key is patience — three safe deploys beats one dangerous one.

Event-Driven Architecture Without the PhD — A Practical Guide

The First 90 Days as an Engineering Manager — What Nobody Tells You

Zero-Downtime Database Migrations — The Patterns That Actually Work

March 18, 2026·ScaledByDesign·

databasemigrationspostgresqldevopszero-downtime

The 4 AM Migration Horror Story

Zero-downtime migrations aren't a luxury. They're a requirement for any production system processing real transactions.

The Rules of Zero-Downtime Migrations

Rule 1: Never lock a table with active traffic
Rule 2: Every migration must be backward-compatible
Rule 3: Deploy in phases — code first, then schema, then cleanup
Rule 4: Always have a rollback plan that doesn't require another migration

Pattern 1: The Expand-Contract Pattern

The safest approach for schema changes. Three phases:

Phase 1 — EXPAND: Add new column/table (nullable, no constraints)
  → Old code still works, new column is just ignored
  
Phase 2 — MIGRATE: Backfill data, deploy code that writes to both
  → Both old and new schemas work simultaneously
  
Phase 3 — CONTRACT: Remove old column/table, add constraints
  → Only after all code uses the new schema

Example: Renaming a column from name to full_name:

-- Phase 1: EXPAND (deploy this migration)
ALTER TABLE customers ADD COLUMN full_name TEXT;
 
-- Phase 2: MIGRATE (deploy code change + backfill)
-- Code now writes to BOTH columns
UPDATE customers SET full_name = name WHERE full_name IS NULL;
-- Run in batches for large tables (see Pattern 3)
 
-- Phase 3: CONTRACT (after all code uses full_name)
ALTER TABLE customers DROP COLUMN name;
ALTER TABLE customers ALTER COLUMN full_name SET NOT NULL;

Each phase is a separate deployment. If anything goes wrong, you roll back to the previous phase — no data loss, no downtime.

Pattern 2: Adding NOT NULL Columns Safely

The migration that caused the 4 AM outage. Here's how to do it safely:

-- ❌ DANGEROUS: Locks the entire table while it adds and backfills
ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending';
 
-- ✅ SAFE: Three-step approach
-- Step 1: Add nullable column (instant, no lock)
ALTER TABLE orders ADD COLUMN status TEXT;
 
-- Step 2: Set default for new rows (instant, no lock)
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
 
-- Step 3: Backfill existing rows in batches
DO $$
DECLARE
  batch_size INT := 10000;
  total_updated INT := 0;
BEGIN
  LOOP
    UPDATE orders
    SET status = 'pending'
    WHERE id IN (
      SELECT id FROM orders
      WHERE status IS NULL
      LIMIT batch_size
      FOR UPDATE SKIP LOCKED
    );
    
    GET DIAGNOSTICS total_updated = ROW_COUNT;
    EXIT WHEN total_updated = 0;
    
    RAISE NOTICE 'Updated % rows', total_updated;
    PERFORM pg_sleep(0.1);  -- Brief pause to reduce load
  END LOOP;
END $$;
 
-- Step 4: Add NOT NULL constraint (after backfill completes)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

The FOR UPDATE SKIP LOCKED clause is critical — it prevents the backfill from blocking concurrent transactions.

Pattern 3: Large Table Backfills

For tables with millions of rows, batch processing is non-negotiable:

async function backfillInBatches(options: {
  table: string;
  setClause: string;
  whereClause: string;
  batchSize: number;
  delayMs: number;
}) {
  let totalUpdated = 0;
  let batchCount = 0;
 
  while (true) {
    const result = await db.query(`
      WITH batch AS (
        SELECT id FROM ${options.table}
        WHERE ${options.whereClause}
        LIMIT ${options.batchSize}
        FOR UPDATE SKIP LOCKED
      )
      UPDATE ${options.table}
      SET ${options.setClause}
      WHERE id IN (SELECT id FROM batch)
    `);
 
    totalUpdated += result.rowCount;
    batchCount++;
 
    console.log(`Batch ${batchCount}: ${result.rowCount} rows (${totalUpdated} total)`);
 
    if (result.rowCount === 0) break;
    await sleep(options.delayMs);  // Reduce database load
  }
 
  return { totalUpdated, batchCount };
}
 
// Usage
await backfillInBatches({
  table: "orders",
  setClause: "status = 'pending'",
  whereClause: "status IS NULL",
  batchSize: 5000,
  delayMs: 100,  // 100ms pause between batches
});

Pattern 4: Creating Indexes Without Locking

Standard CREATE INDEX locks the table for writes. Use CONCURRENTLY:

-- ❌ LOCKS the table during index creation
CREATE INDEX idx_orders_status ON orders (status);
 
-- ✅ Builds index without blocking writes
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);

The CONCURRENTLY option takes longer but doesn't block any queries. One caveat: it can't run inside a transaction block, and if it fails, you need to drop the invalid index and retry.

The Migration Checklist

Before running any migration on production:

Pre-Migration:
  □ Tested on a production-sized dataset (not just dev with 100 rows)
  □ Measured lock time — any operation > 1 second needs the safe pattern
  □ Backfill script tested and idempotent (safe to run multiple times)
  □ Rollback plan documented and tested
  □ Monitoring alerts set up for query latency and lock waits

During Migration:
  □ Run during lowest-traffic window (but don't depend on low traffic)
  □ Monitor active locks: SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock'
  □ Monitor replication lag if using replicas
  □ Have a kill switch ready for long-running queries

Post-Migration:
  □ Verify data integrity (row counts, null checks, constraint violations)
  □ Monitor application error rates for 30 minutes
  □ Run ANALYZE on affected tables to update query planner statistics
  □ Document the migration for the team

Event-Driven Architecture Without the PhD — A Practical Guide

The First 90 Days as an Engineering Manager — What Nobody Tells You

Zero-Downtime Database Migrations — The Patterns That Actually Work

The 4 AM Migration Horror Story

The Rules of Zero-Downtime Migrations

Pattern 1: The Expand-Contract Pattern

Pattern 2: Adding NOT NULL Columns Safely

Pattern 3: Large Table Backfills

Pattern 4: Creating Indexes Without Locking

The Migration Checklist

Ready to Ship?

Zero-Downtime Database Migrations — The Patterns That Actually Work

The 4 AM Migration Horror Story

The Rules of Zero-Downtime Migrations

Pattern 1: The Expand-Contract Pattern

Pattern 2: Adding NOT NULL Columns Safely

Pattern 3: Large Table Backfills

Pattern 4: Creating Indexes Without Locking

The Migration Checklist

Ready to Ship?