Multi-Region Deployment Without the Headache
The $180K Wake-Up Call
A client called us after us-east-1 went down on a Friday afternoon. Their entire platform — API, dashboard, webhooks, everything — was dead for 4 hours. They lost two enterprise deals that were in final contract review. The prospects saw the outage, Googled "is [company] reliable," found the Hacker News thread, and ghosted.
Total damage: $180K in lost revenue, plus the reputation hit that's still costing them six months later.
Their CTO's first words to us: "We need to go multi-region." Our first question back: "Are you sure?"
Most Teams Go Multi-Region for the Wrong Reasons
Here's the uncomfortable truth: most companies don't need multi-region. They need better single-region resilience. Multi-AZ deployments, proper health checks, and circuit breakers solve 90% of what people think requires multi-region.
There are exactly two legitimate reasons to go multi-region:
- Your users are global and 200ms+ latency is killing conversions
- A single-region outage is an existential risk to your business
If neither of those is true, you're about to spend $50K+/year solving a problem you don't have. We've talked clients out of multi-region more often than we've built it.
But when you do need it, here's how we approach it without losing your mind.
The Three Approaches (Pick the Simplest One That Works)
| Approach | Complexity | Latency Win | Resilience | When to Use |
|---|---|---|---|---|
| Active-Passive | Low | ✗ None | ✓ Good | "We just need failover" |
| Active-Active Reads | Medium | ✓ Reads only | ✓ Better | "Our EU users are complaining" |
| Active-Active Full | High | ✓ Everything | ✓ Best | "We're processing payments globally" |
Most teams should start with Active-Passive and never move past Active-Active Reads. If someone on your team is pushing for full Active-Active, make them explain the conflict resolution strategy. That conversation usually ends the debate.
Active-Active Reads: The Sweet Spot
This is where we land 80% of the time. Reads go to the nearest region, writes go to primary. Simple enough to operate, impactful enough to matter:
// The routing layer is dead simple — don't overthink this
class DatabaseRouter {
getConnection(operation: "read" | "write"): DatabaseConnection {
if (operation === "write") {
return this.primaryConnection; // Always goes to us-east-1
}
// Reads go to whatever's closest
const currentRegion = process.env.AWS_REGION;
const replica = this.replicas.find(r => r.region === currentRegion);
return replica?.connection ?? this.primaryConnection;
}
}
// Your application code barely changes
async function getOrder(orderId: string): Promise<Order> {
const db = router.getConnection("read"); // Local replica — 5ms not 200ms
return db.query("SELECT * FROM orders WHERE id = $1", [orderId]);
}
async function createOrder(data: OrderInput): Promise<Order> {
const db = router.getConnection("write"); // Crosses the ocean, but writes are rare
return db.query("INSERT INTO orders ...", [data]);
}The key insight we keep hammering into teams: reads outnumber writes 10:1 in most applications. If you can make reads local, you've solved 90% of the latency problem without touching the hard consistency stuff.
DNS Routing: The Part Everyone Overcomplicates
We've seen teams spend weeks building custom routing logic. Don't. Route 53 latency-based routing or Cloudflare does this out of the box:
api.example.com
→ User in New York → us-east-1 (12ms)
→ User in London → eu-west-1 (8ms)
→ User in Singapore → ap-southeast-1 (15ms)
The health check is the part that actually matters. And most teams get it wrong — they check if the server is running, not if the server can actually serve requests:
// Bad: checks if the process is alive (useless)
app.get("/health", (req, res) => res.json({ status: "ok" }));
// Good: checks if the full stack actually works
app.get("/health", async (req, res) => {
const checks = await Promise.allSettled([
db.query("SELECT 1"),
redis.ping(),
fetch(EXTERNAL_SERVICE_URL + "/health", {
signal: AbortSignal.timeout(3000)
}),
]);
const allHealthy = checks.every(c => c.status === "fulfilled");
res.status(allHealthy ? 200 : 503).json({
status: allHealthy ? "healthy" : "degraded",
region: process.env.AWS_REGION,
checks: {
database: checks[0].status === "fulfilled" ? "ok" : "failed",
cache: checks[1].status === "fulfilled" ? "ok" : "failed",
external: checks[2].status === "fulfilled" ? "ok" : "failed",
},
});
});If your health check returns 200 when the database is down, your DNS routing will happily send traffic to a region that can't serve it. We've seen this cause cascading failures that were worse than the original outage.
Data Replication: Where Dreams Go to Die
This is the part that separates "we went multi-region" from "we went multi-region and it actually works." Here's what we recommend based on dozens of implementations:
| Option | Lag | Complexity | Our Take |
|---|---|---|---|
| PostgreSQL streaming replication | 10ms–1s | Low | ✓ Start here. Seriously. |
| Aurora Global Database | < 1s | Low | ✓ If you're on AWS and have the budget |
| CockroachDB | None | High | ✗ Unless you have a team that's run it before |
| Citus | None | High | ✗ Same — great tech, brutal to operate |
We've watched three different teams try to adopt CockroachDB for multi-region. One succeeded (they had a former Cockroach Labs engineer). Two rolled back to Aurora after burning 4+ months. The technology is impressive. The operational overhead is real.
For caching, skip the fancy global replication. Local Redis per region with a 30-second TTL handles 95% of use cases. Accept slightly stale reads. Your users won't notice. Your ops team will thank you.
The Read-After-Write Problem (The One That Bites You)
Here's the scenario that catches every team off guard: a user updates their profile in us-east-1, the page refreshes, and the read hits the eu-west-1 replica that hasn't caught up yet. The user sees their old data. They update again. Now you have a mess.
The fix is embarrassingly simple:
class ConsistencyManager {
private recentWrites = new Map<string, number>();
recordWrite(userId: string): void {
this.recentWrites.set(userId, Date.now());
}
shouldReadFromPrimary(userId: string): boolean {
const lastWrite = this.recentWrites.get(userId);
if (!lastWrite) return false;
// Read from primary for 5 seconds after any write
return Date.now() - lastWrite < 5000;
}
}After a write, route that user's reads to primary for 5 seconds. That's it. Replication catches up, user sees fresh data, everyone's happy. We've used this pattern on every multi-region project and it's never been the thing that breaks.
The Real Cost (It's More Than You Think)
Here's the math we walk clients through before they commit:
Active-Passive: +40-60% of your current infra bill
Active-Active Reads: +80-120%
Active-Active Full: +150-200%
Plus the hidden costs nobody mentions:
→ Engineering time to build and test: 4-8 weeks
→ Ongoing operational complexity: +1 senior engineer's attention
→ Incident response is now 2x harder (which region broke?)
→ Every new service needs multi-region consideration
The ROI formula is simple: if one hour of downtime costs more than one month of multi-region infrastructure, do it. For the client who lost $180K? Their multi-region setup costs $6K/month. One prevented outage pays for 2.5 years.
Stop Over-Engineering This
Here's what we tell every client: start with Active-Passive. It takes a week to set up, costs 40% more, and gives you the resilience story. If your EU users start complaining about latency, add read replicas. That's Active-Active Reads and it handles 80% of global use cases.
Full Active-Active? We've built it exactly twice in five years. Both times for financial services companies processing payments on multiple continents. If you're not in that category, you almost certainly don't need it.
The best multi-region architecture is the simplest one that solves your actual problem. Not the one that looks impressive in an architecture review.