Observability That Actually Helps You Sleep at Night

December 31, 2025·ScaledByDesign·

observabilitymonitoringdevopsinfrastructure

Your Monitoring Is Noise

You have Datadog. Or Grafana. Or New Relic. The dashboards look impressive. Nobody looks at them. When something breaks, the team opens Slack and asks "is anyone else seeing this?" — which means your $50k/year monitoring investment is a screensaver.

Why this costs you: We audited a client spending $63K/year on observability tools. In 6 months, they had 14 production incidents. Average time to detection: 8.3 minutes (customers noticed first, not monitoring). Average time to resolution: 47 minutes. Total revenue lost to undetected incidents: $340K. The monitoring was generating 2,400 alerts per week — all ignored.

Observability isn't dashboards. It's the ability to understand what your system is doing when things go wrong — and ideally, before they go wrong.

The Three Pillars (Actually Useful Version)

Pillar 1: Structured Logging

// ❌ Useless log
console.log("Order processed");
 
// ❌ Slightly better but still useless
console.log(`Order ${orderId} processed for customer ${customerId}`);
 
// ✅ Structured, searchable, actionable
logger.info("order.processed", {
  orderId,
  customerId,
  total: order.total,
  itemCount: order.items.length,
  paymentMethod: order.paymentMethod,
  processingTimeMs: Date.now() - startTime,
  isFirstOrder: customer.orderCount === 1,
});

The rules:

Every log has a dot-notation event name (order.processed, payment.failed)
Every log includes the entity IDs involved
Every log includes timing information
Logs are JSON, not strings — so you can query them

Pillar 2: Metrics That Matter

Stop measuring everything. Measure these:

Business Metrics (the ones that pay the bills):
  ├── Orders per minute (is the business working?)
  ├── Revenue per hour (are we making money?)
  ├── Checkout completion rate (is checkout broken?)
  └── Error rate by endpoint (what's failing?)

Infrastructure Metrics (the ones that predict problems):
  ├── Response time p50/p95/p99
  ├── Database connection pool utilization
  ├── Memory usage trend (not current, TREND)
  ├── Queue depth and processing lag
  └── Disk usage and growth rate

THE golden metric:
  └── "Can a customer complete a purchase right now?"
      If you can only monitor one thing, monitor this.

Pillar 3: Distributed Tracing

For any request that touches multiple services or takes > 200ms:

// Trace a critical path
const span = tracer.startSpan("checkout.process");
span.setAttributes({
  "checkout.orderId": orderId,
  "checkout.total": total,
  "checkout.itemCount": items.length,
});
 
try {
  const payment = await tracer.trace("checkout.payment", () =>
    processPayment(order)
  );
  const fulfillment = await tracer.trace("checkout.fulfillment", () =>
    createFulfillment(order)
  );
  span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
  span.setStatus({ code: SpanStatusCode.ERROR });
  span.recordException(error);
  throw error;
} finally {
  span.end();
}

The Alert Strategy That Doesn't Cry Wolf

Tier 1: Page Someone (Immediately)

These wake people up at 3 AM:

alerts:
  - name: checkout_broken
    condition: checkout_success_rate < 90%
    for: 5 minutes
    severity: critical
    action: page on-call
 
  - name: payment_failures_spike
    condition: payment_failure_rate > 15%
    for: 3 minutes
    severity: critical
    action: page on-call
 
  - name: site_down
    condition: health_check_failures > 3
    for: 2 minutes
    severity: critical
    action: page on-call

Rules for Tier 1:

Maximum 3-5 alert types
Every alert has a runbook linked
If it pages and doesn't need action, remove it
Review monthly: if an alert fires and nobody acts, it's noise

The cost of too many alerts: One team had 47 different paging alerts. On-call got woken up 3-5 times per night. After 4 months: 60% of the team refused on-call rotation, two senior engineers quit citing burnout. Cost to replace them: $180K in recruiting + 6 months of reduced velocity.

We reduced alerts to 4 critical types. Pages dropped from 4.2/night to 0.3/night. Zero engineers quit in the following 12 months.

Tier 2: Notify the Team (Business Hours)

alerts:
  - name: response_time_degraded
    condition: p95_response_time > 2s
    for: 15 minutes
    severity: warning
    action: slack #engineering
 
  - name: error_rate_elevated
    condition: error_rate > 2%
    for: 10 minutes
    severity: warning
    action: slack #engineering
 
  - name: disk_space_low
    condition: disk_usage > 80%
    for: 30 minutes
    severity: warning
    action: slack #infrastructure

Tier 3: Log for Investigation (No Notification)

Everything else. Visible in dashboards, searchable in logs, but doesn't interrupt anyone.

The Runbook Pattern

Every Tier 1 alert needs a runbook. No exceptions.

## Alert: checkout_broken
**What it means:** Checkout success rate dropped below 90%
 
### Quick diagnosis (< 2 minutes)
1. Check payment provider status: [status page URL]
2. Check database connections: `SELECT count(*) FROM pg_stat_activity`
3. Check recent deployments: [deployment dashboard URL]
 
### Common causes
1. **Payment provider outage**
   → Action: Enable backup provider, notify customers
2. **Database connection pool exhausted**
   → Action: Restart app servers, investigate long-running queries
3. **Bad deployment**
   → Action: Rollback last deployment
 
### Escalation
If not resolved in 15 minutes:
  → Page @engineering-lead
If not resolved in 30 minutes:
  → Page @cto + notify @customer-support

The Observability Stack (Startup Budget)

You don't need to spend $50k/year:

Component	Budget Option	Premium Option
Logging	Grafana Loki (free)	Datadog ($$$)
Metrics	Prometheus + Grafana (free)	Datadog ($$$)
Tracing	Jaeger (free)	Honeycomb ($$)
Alerting	Grafana Alerting (free)	PagerDuty ($$)
Uptime	UptimeRobot ($7/mo)	Datadog Synthetics ($$$)

Total budget option: $50-200/month Total premium option: $2,000-10,000/month

Real ROI calculation: One client switched from Datadog ($6,800/month) to self-hosted Grafana stack ($180/month). Saved $79K annually. Time to implement: 2 weeks. The budget stack detected incidents just as fast — the premium was paying for features they never used.

Start with the budget option. Upgrade components that become pain points.

The Implementation Roadmap

Week 1: The Basics

Structured logging in your application
Health check endpoint that tests critical dependencies
Uptime monitoring for your site and API

Week 2: Business Metrics

Order rate, revenue, and checkout completion dashboards
Error rate by endpoint
Tier 1 alerts for checkout and payment

Week 3: Infrastructure Metrics

Response time percentiles
Database and cache metrics
Queue depth monitoring

Week 4: Runbooks

Write a runbook for every Tier 1 alert
Run a game day: trigger an alert and follow the runbook
Iterate on what's missing

The Test

Here's how you know your observability works: at 2 AM, a Tier 1 alert fires. The on-call engineer opens the runbook, follows the steps, and resolves the issue in 15 minutes — without waking anyone else up.

The math that matters:

Bad observability (before):
  - Mean time to detection (MTTD): 8.3 minutes (customers report it)
  - Mean time to resolution (MTTR): 47 minutes
  - Incidents per month: 14
  - Downtime per month: 658 minutes (11 hours)
  - Revenue lost at $15K/hour GMV: $165K/month

Good observability (after):
  - MTTD: 1.2 minutes (alerts fire before customers notice)
  - MTTR: 12 minutes (runbooks guide resolution)
  - Incidents per month: 14 (same issues, better response)
  - Downtime per month: 168 minutes (2.8 hours)
  - Revenue lost: $42K/month

Improvement: $123K/month in recovered revenue
Investment: $4K in engineering time to fix alerts + runbooks
ROI: 30x in first month

If that's not your reality, your observability needs work. Not more dashboards — better alerts, better runbooks, and better signal-to-noise ratio.

API Design Mistakes That Will Haunt You for Years

The Caching Strategy That Cut Our Client's AWS Bill by 60%

Observability That Actually Helps You Sleep at Night

December 31, 2025·ScaledByDesign·

observabilitymonitoringdevopsinfrastructure

Your Monitoring Is Noise

Observability isn't dashboards. It's the ability to understand what your system is doing when things go wrong — and ideally, before they go wrong.

The Three Pillars (Actually Useful Version)

Pillar 1: Structured Logging

// ❌ Useless log
console.log("Order processed");
 
// ❌ Slightly better but still useless
console.log(`Order ${orderId} processed for customer ${customerId}`);
 
// ✅ Structured, searchable, actionable
logger.info("order.processed", {
  orderId,
  customerId,
  total: order.total,
  itemCount: order.items.length,
  paymentMethod: order.paymentMethod,
  processingTimeMs: Date.now() - startTime,
  isFirstOrder: customer.orderCount === 1,
});

The rules:

Every log has a dot-notation event name (order.processed, payment.failed)
Every log includes the entity IDs involved
Every log includes timing information
Logs are JSON, not strings — so you can query them

Pillar 2: Metrics That Matter

Stop measuring everything. Measure these:

Business Metrics (the ones that pay the bills):
  ├── Orders per minute (is the business working?)
  ├── Revenue per hour (are we making money?)
  ├── Checkout completion rate (is checkout broken?)
  └── Error rate by endpoint (what's failing?)

Infrastructure Metrics (the ones that predict problems):
  ├── Response time p50/p95/p99
  ├── Database connection pool utilization
  ├── Memory usage trend (not current, TREND)
  ├── Queue depth and processing lag
  └── Disk usage and growth rate

THE golden metric:
  └── "Can a customer complete a purchase right now?"
      If you can only monitor one thing, monitor this.

Pillar 3: Distributed Tracing

For any request that touches multiple services or takes > 200ms:

// Trace a critical path
const span = tracer.startSpan("checkout.process");
span.setAttributes({
  "checkout.orderId": orderId,
  "checkout.total": total,
  "checkout.itemCount": items.length,
});
 
try {
  const payment = await tracer.trace("checkout.payment", () =>
    processPayment(order)
  );
  const fulfillment = await tracer.trace("checkout.fulfillment", () =>
    createFulfillment(order)
  );
  span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
  span.setStatus({ code: SpanStatusCode.ERROR });
  span.recordException(error);
  throw error;
} finally {
  span.end();
}

The Alert Strategy That Doesn't Cry Wolf

Tier 1: Page Someone (Immediately)

These wake people up at 3 AM:

alerts:
  - name: checkout_broken
    condition: checkout_success_rate < 90%
    for: 5 minutes
    severity: critical
    action: page on-call
 
  - name: payment_failures_spike
    condition: payment_failure_rate > 15%
    for: 3 minutes
    severity: critical
    action: page on-call
 
  - name: site_down
    condition: health_check_failures > 3
    for: 2 minutes
    severity: critical
    action: page on-call

Rules for Tier 1:

Maximum 3-5 alert types
Every alert has a runbook linked
If it pages and doesn't need action, remove it
Review monthly: if an alert fires and nobody acts, it's noise

We reduced alerts to 4 critical types. Pages dropped from 4.2/night to 0.3/night. Zero engineers quit in the following 12 months.

Tier 2: Notify the Team (Business Hours)

alerts:
  - name: response_time_degraded
    condition: p95_response_time > 2s
    for: 15 minutes
    severity: warning
    action: slack #engineering
 
  - name: error_rate_elevated
    condition: error_rate > 2%
    for: 10 minutes
    severity: warning
    action: slack #engineering
 
  - name: disk_space_low
    condition: disk_usage > 80%
    for: 30 minutes
    severity: warning
    action: slack #infrastructure

Tier 3: Log for Investigation (No Notification)

Everything else. Visible in dashboards, searchable in logs, but doesn't interrupt anyone.

The Runbook Pattern

Every Tier 1 alert needs a runbook. No exceptions.

## Alert: checkout_broken
**What it means:** Checkout success rate dropped below 90%
 
### Quick diagnosis (< 2 minutes)
1. Check payment provider status: [status page URL]
2. Check database connections: `SELECT count(*) FROM pg_stat_activity`
3. Check recent deployments: [deployment dashboard URL]
 
### Common causes
1. **Payment provider outage**
   → Action: Enable backup provider, notify customers
2. **Database connection pool exhausted**
   → Action: Restart app servers, investigate long-running queries
3. **Bad deployment**
   → Action: Rollback last deployment
 
### Escalation
If not resolved in 15 minutes:
  → Page @engineering-lead
If not resolved in 30 minutes:
  → Page @cto + notify @customer-support

The Observability Stack (Startup Budget)

You don't need to spend $50k/year:

Component	Budget Option	Premium Option
Logging	Grafana Loki (free)	Datadog ($$$)
Metrics	Prometheus + Grafana (free)	Datadog ($$$)
Tracing	Jaeger (free)	Honeycomb ($$)
Alerting	Grafana Alerting (free)	PagerDuty ($$)
Uptime	UptimeRobot ($7/mo)	Datadog Synthetics ($$$)

Total budget option: $50-200/month Total premium option: $2,000-10,000/month

Start with the budget option. Upgrade components that become pain points.

The Implementation Roadmap

Week 1: The Basics

Structured logging in your application
Health check endpoint that tests critical dependencies
Uptime monitoring for your site and API

Week 2: Business Metrics

Order rate, revenue, and checkout completion dashboards
Error rate by endpoint
Tier 1 alerts for checkout and payment

Week 3: Infrastructure Metrics

Response time percentiles
Database and cache metrics
Queue depth monitoring

Week 4: Runbooks

Write a runbook for every Tier 1 alert
Run a game day: trigger an alert and follow the runbook
Iterate on what's missing

The Test

The math that matters:

Bad observability (before):
  - Mean time to detection (MTTD): 8.3 minutes (customers report it)
  - Mean time to resolution (MTTR): 47 minutes
  - Incidents per month: 14
  - Downtime per month: 658 minutes (11 hours)
  - Revenue lost at $15K/hour GMV: $165K/month

Good observability (after):
  - MTTD: 1.2 minutes (alerts fire before customers notice)
  - MTTR: 12 minutes (runbooks guide resolution)
  - Incidents per month: 14 (same issues, better response)
  - Downtime per month: 168 minutes (2.8 hours)
  - Revenue lost: $42K/month

Improvement: $123K/month in recovered revenue
Investment: $4K in engineering time to fix alerts + runbooks
ROI: 30x in first month

If that's not your reality, your observability needs work. Not more dashboards — better alerts, better runbooks, and better signal-to-noise ratio.

API Design Mistakes That Will Haunt You for Years

The Caching Strategy That Cut Our Client's AWS Bill by 60%

Observability That Actually Helps You Sleep at Night

Your Monitoring Is Noise

The Three Pillars (Actually Useful Version)

Pillar 1: Structured Logging

Pillar 2: Metrics That Matter

Pillar 3: Distributed Tracing

The Alert Strategy That Doesn't Cry Wolf

Tier 1: Page Someone (Immediately)

Tier 2: Notify the Team (Business Hours)

Tier 3: Log for Investigation (No Notification)

The Runbook Pattern

The Observability Stack (Startup Budget)

The Implementation Roadmap

Week 1: The Basics

Week 2: Business Metrics

Week 3: Infrastructure Metrics

Week 4: Runbooks

The Test

Ready to Ship?

Observability That Actually Helps You Sleep at Night

Your Monitoring Is Noise

The Three Pillars (Actually Useful Version)

Pillar 1: Structured Logging

Pillar 2: Metrics That Matter

Pillar 3: Distributed Tracing

The Alert Strategy That Doesn't Cry Wolf

Tier 1: Page Someone (Immediately)

Tier 2: Notify the Team (Business Hours)

Tier 3: Log for Investigation (No Notification)

The Runbook Pattern

The Observability Stack (Startup Budget)

The Implementation Roadmap

Week 1: The Basics

Week 2: Business Metrics

Week 3: Infrastructure Metrics

Week 4: Runbooks

The Test

Ready to Ship?