ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

The Migration That Took Down ProductionWhy Migrations Are DangerousThe Safe Migration PlaybookPattern 1: Add Column (The Right Way)Pattern 2: Create Index (Always Concurrently)Pattern 3: Change Column Type (The Expand-Contract Pattern)Pattern 4: Add NOT NULL ConstraintPattern 5: Add Foreign KeyThe Backfill StrategyThe Pre-Migration ChecklistTooling That HelpsThe Rule
  1. Insights
  2. Infrastructure
  3. Database Migrations Without Downtime

Database Migrations Without Downtime

January 4, 2026·ScaledByDesign·
databasemigrationsdeploymentpostgres

The Migration That Took Down Production

It's always the same story: a developer runs ALTER TABLE orders ADD COLUMN discount_code VARCHAR(255) on a table with 50 million rows. PostgreSQL acquires an ACCESS EXCLUSIVE lock, writes block, reads pile up, the connection pool exhausts, and the site goes down for 20 minutes while the migration finishes.

The real cost: One client ran this exact migration during business hours on a table with 80M rows. Site down for 23 minutes. Orders lost: 1,247. Revenue impact: $94K. Customer support tickets: 600+. Three enterprise deals stalled because prospects saw the downtime. Total cost: $220K+ for a migration that should have taken zero downtime.

This is preventable. Every time.

Why Migrations Are Dangerous

PostgreSQL (and most databases) need locks to modify schema. The danger level depends on the operation:

OperationLock TypeDanger Level
ADD COLUMN (nullable, no default)ACCESS EXCLUSIVE (brief)Low ✅
ADD COLUMN with DEFAULT (PG 11+)ACCESS EXCLUSIVE (brief)Low ✅
ADD COLUMN with DEFAULT (PG < 11)ACCESS EXCLUSIVE (rewrites table)Critical ❌
DROP COLUMNACCESS EXCLUSIVE (brief)Medium ⚠️
ALTER COLUMN TYPEACCESS EXCLUSIVE (rewrites table)Critical ❌
ADD INDEXSHARE (blocks writes)High ⚠️
ADD INDEX CONCURRENTLYNone (mostly)Low ✅
ADD NOT NULL constraintACCESS EXCLUSIVE (scans table)High ⚠️
ADD FOREIGN KEYSHARE ROW EXCLUSIVEHigh ⚠️
RENAME COLUMNACCESS EXCLUSIVE (brief)Medium ⚠️

The Safe Migration Playbook

Pattern 1: Add Column (The Right Way)

-- ✅ SAFE: Nullable column, no default
ALTER TABLE orders ADD COLUMN discount_code VARCHAR(255);
 
-- ✅ SAFE on PG 11+: With default (metadata-only change)
ALTER TABLE orders ADD COLUMN status VARCHAR(50) DEFAULT 'pending';
 
-- ❌ DANGEROUS on PG < 11: Rewrites entire table
ALTER TABLE orders ADD COLUMN status VARCHAR(50) DEFAULT 'pending';

Pattern 2: Create Index (Always Concurrently)

-- ❌ DANGEROUS: Blocks all writes until complete
CREATE INDEX idx_orders_email ON orders (email);
 
-- ✅ SAFE: Builds index without blocking writes
CREATE INDEX CONCURRENTLY idx_orders_email ON orders (email);
 
-- Note: CONCURRENTLY can't run inside a transaction
-- Your migration tool needs to support this

Important: If a concurrent index build fails, it leaves an invalid index. Always check:

SELECT indexrelid::regclass, indisvalid 
FROM pg_index 
WHERE NOT indisvalid;
-- Drop invalid indexes and retry

Pattern 3: Change Column Type (The Expand-Contract Pattern)

Never ALTER COLUMN TYPE on a large table. Instead:

Why this matters: ALTER COLUMN TYPE on a 30M row table rewrites the entire table while holding an ACCESS EXCLUSIVE lock. On production hardware, that's 8-15 minutes of total downtime. At $8K/minute in lost GMV, that's $64K-120K in lost revenue. The expand-contract pattern adds zero downtime — just 3 extra deploy steps spread over a week.

Step 1: Add new column
  ALTER TABLE orders ADD COLUMN amount_v2 BIGINT;

Step 2: Backfill (in batches)
  UPDATE orders SET amount_v2 = amount 
  WHERE id BETWEEN 1 AND 10000;
  -- Repeat in batches of 10k

Step 3: Deploy code that writes to BOTH columns
  -- Application writes to amount AND amount_v2

Step 4: Verify data consistency
  SELECT COUNT(*) FROM orders 
  WHERE amount_v2 IS NULL AND created_at < NOW() - INTERVAL '1 hour';

Step 5: Switch reads to new column
  -- Application reads from amount_v2

Step 6: Stop writing to old column
  -- Application only writes to amount_v2

Step 7: Drop old column (in a later migration)
  ALTER TABLE orders DROP COLUMN amount;

Step 8: Rename new column (optional)
  ALTER TABLE orders RENAME COLUMN amount_v2 TO amount;

Yes, it's more steps. No, your site doesn't go down.

Pattern 4: Add NOT NULL Constraint

-- ❌ DANGEROUS: Scans entire table while holding lock
ALTER TABLE orders ALTER COLUMN email SET NOT NULL;
 
-- ✅ SAFE: Add as a CHECK constraint with NOT VALID
ALTER TABLE orders ADD CONSTRAINT orders_email_not_null 
  CHECK (email IS NOT NULL) NOT VALID;
 
-- Then validate separately (no exclusive lock needed)
ALTER TABLE orders VALIDATE CONSTRAINT orders_email_not_null;

Pattern 5: Add Foreign Key

-- ❌ DANGEROUS: Validates all existing rows while holding lock
ALTER TABLE orders ADD FOREIGN KEY (customer_id) 
  REFERENCES customers (id);
 
-- ✅ SAFE: Add without validation, then validate separately
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer 
  FOREIGN KEY (customer_id) REFERENCES customers (id) 
  NOT VALID;
 
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_customer;

The Backfill Strategy

Large data backfills are as dangerous as schema changes if done wrong:

// ❌ DANGEROUS: One massive UPDATE
await db.query(`UPDATE orders SET discount_code = 'none' WHERE discount_code IS NULL`);
// Locks millions of rows, generates massive WAL, kills replication lag
 
Real failure: A team ran a one-shot UPDATE on 45M rows to backfill a new
field. The query ran for 18 minutes, generated 80GB of WAL, and pushed
replication lag to 22 minutes. Read replicas served stale data. Inventory
showed products in stock that were actually sold out. Oversold 340 items.
Cost: $28K in customer credits + support burden.
 
// ✅ SAFE: Batched updates with pauses
async function backfill() {
  let lastId = 0;
  const BATCH_SIZE = 5000;
  
  while (true) {
    const result = await db.query(`
      UPDATE orders 
      SET discount_code = 'none' 
      WHERE id > $1 
        AND id <= $1 + $2
        AND discount_code IS NULL
      RETURNING id
    `, [lastId, BATCH_SIZE]);
    
    if (result.rowCount === 0) break;
    
    lastId += BATCH_SIZE;
    
    // Pause to let replication catch up
    await sleep(100);
    
    // Check replication lag
    const lag = await getReplicationLag();
    if (lag > 5000) {
      console.log(`Replication lag ${lag}ms, pausing...`);
      await sleep(5000);
    }
  }
}

The Pre-Migration Checklist

Run before every production migration:

Before:
  [ ] Tested on staging with production-size data
  [ ] Checked lock type for each operation
  [ ] Verified no long-running queries on target tables
  [ ] Backfill strategy documented for data migrations
  [ ] Rollback plan written and tested
  [ ] Team notified of migration window

During:
  [ ] Monitor: active queries, lock waits, replication lag
  [ ] Set statement_timeout to prevent runaway locks
  [ ] Run during low-traffic window if possible

After:
  [ ] Verify schema changes applied correctly
  [ ] Check for invalid indexes
  [ ] Monitor application error rates for 30 minutes
  [ ] Verify replication is caught up

Tooling That Helps

  • pg_stat_activity — see active queries and locks in real-time
  • pg_locks — understand what's waiting for what
  • pgroll — automated zero-downtime schema migrations
  • reshape — another zero-downtime migration tool
  • strong_migrations (Ruby) / safe-pg-migrations — catch dangerous migrations in CI

The Rule

If a migration takes longer than 1 second on your largest table, it needs the expand-contract pattern. No exceptions. The 10 minutes of extra development time is always worth avoiding the 20 minutes of downtime and the incident retro that follows.

The math is simple:

Unsafe migration:
  - Development time: 15 minutes (write the ALTER statement)
  - Downtime: 15-25 minutes (table locked during migration)
  - Revenue lost at $6K/min: $90K-150K
  - Customer impact: 100% of users
  - Total cost: $90K-150K + reputation damage

Safe migration (expand-contract):
  - Development time: 90 minutes (7 steps over 3 deploys)
  - Downtime: 0 minutes
  - Revenue lost: $0
  - Customer impact: 0%
  - Total cost: 2 hours of engineering time ($200)

ROI: Spend $200 to save $90K-150K. Yet teams skip it to "move fast."

Schema changes are a solved problem. The solution is just slower and more deliberate than most teams want to be. Be deliberate anyway.

Previous
Payment Processing Architecture for High-Volume Merchants
Next
The Inventory Forecasting System That Stopped Our Client From Overselling
Insights
Scale Postgres Before Reaching for NoSQLDatabase Migrations Without DowntimeObservability That Actually Helps You Sleep at Night

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call