Observability That Actually Helps You Sleep at Night
Your Monitoring Is Noise
You have Datadog. Or Grafana. Or New Relic. The dashboards look impressive. Nobody looks at them. When something breaks, the team opens Slack and asks "is anyone else seeing this?" — which means your $50k/year monitoring investment is a screensaver.
Why this costs you: We audited a client spending $63K/year on observability tools. In 6 months, they had 14 production incidents. Average time to detection: 8.3 minutes (customers noticed first, not monitoring). Average time to resolution: 47 minutes. Total revenue lost to undetected incidents: $340K. The monitoring was generating 2,400 alerts per week — all ignored.
Observability isn't dashboards. It's the ability to understand what your system is doing when things go wrong — and ideally, before they go wrong.
The Three Pillars (Actually Useful Version)
Pillar 1: Structured Logging
// ❌ Useless log
console.log("Order processed");
// ❌ Slightly better but still useless
console.log(`Order ${orderId} processed for customer ${customerId}`);
// ✅ Structured, searchable, actionable
logger.info("order.processed", {
orderId,
customerId,
total: order.total,
itemCount: order.items.length,
paymentMethod: order.paymentMethod,
processingTimeMs: Date.now() - startTime,
isFirstOrder: customer.orderCount === 1,
});The rules:
- Every log has a dot-notation event name (
order.processed,payment.failed) - Every log includes the entity IDs involved
- Every log includes timing information
- Logs are JSON, not strings — so you can query them
Pillar 2: Metrics That Matter
Stop measuring everything. Measure these:
Business Metrics (the ones that pay the bills):
├── Orders per minute (is the business working?)
├── Revenue per hour (are we making money?)
├── Checkout completion rate (is checkout broken?)
└── Error rate by endpoint (what's failing?)
Infrastructure Metrics (the ones that predict problems):
├── Response time p50/p95/p99
├── Database connection pool utilization
├── Memory usage trend (not current, TREND)
├── Queue depth and processing lag
└── Disk usage and growth rate
THE golden metric:
└── "Can a customer complete a purchase right now?"
If you can only monitor one thing, monitor this.
Pillar 3: Distributed Tracing
For any request that touches multiple services or takes > 200ms:
// Trace a critical path
const span = tracer.startSpan("checkout.process");
span.setAttributes({
"checkout.orderId": orderId,
"checkout.total": total,
"checkout.itemCount": items.length,
});
try {
const payment = await tracer.trace("checkout.payment", () =>
processPayment(order)
);
const fulfillment = await tracer.trace("checkout.fulfillment", () =>
createFulfillment(order)
);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR });
span.recordException(error);
throw error;
} finally {
span.end();
}The Alert Strategy That Doesn't Cry Wolf
Tier 1: Page Someone (Immediately)
These wake people up at 3 AM:
alerts:
- name: checkout_broken
condition: checkout_success_rate < 90%
for: 5 minutes
severity: critical
action: page on-call
- name: payment_failures_spike
condition: payment_failure_rate > 15%
for: 3 minutes
severity: critical
action: page on-call
- name: site_down
condition: health_check_failures > 3
for: 2 minutes
severity: critical
action: page on-callRules for Tier 1:
- Maximum 3-5 alert types
- Every alert has a runbook linked
- If it pages and doesn't need action, remove it
- Review monthly: if an alert fires and nobody acts, it's noise
The cost of too many alerts: One team had 47 different paging alerts. On-call got woken up 3-5 times per night. After 4 months: 60% of the team refused on-call rotation, two senior engineers quit citing burnout. Cost to replace them: $180K in recruiting + 6 months of reduced velocity.
We reduced alerts to 4 critical types. Pages dropped from 4.2/night to 0.3/night. Zero engineers quit in the following 12 months.
Tier 2: Notify the Team (Business Hours)
alerts:
- name: response_time_degraded
condition: p95_response_time > 2s
for: 15 minutes
severity: warning
action: slack #engineering
- name: error_rate_elevated
condition: error_rate > 2%
for: 10 minutes
severity: warning
action: slack #engineering
- name: disk_space_low
condition: disk_usage > 80%
for: 30 minutes
severity: warning
action: slack #infrastructureTier 3: Log for Investigation (No Notification)
Everything else. Visible in dashboards, searchable in logs, but doesn't interrupt anyone.
The Runbook Pattern
Every Tier 1 alert needs a runbook. No exceptions.
## Alert: checkout_broken
**What it means:** Checkout success rate dropped below 90%
### Quick diagnosis (< 2 minutes)
1. Check payment provider status: [status page URL]
2. Check database connections: `SELECT count(*) FROM pg_stat_activity`
3. Check recent deployments: [deployment dashboard URL]
### Common causes
1. **Payment provider outage**
→ Action: Enable backup provider, notify customers
2. **Database connection pool exhausted**
→ Action: Restart app servers, investigate long-running queries
3. **Bad deployment**
→ Action: Rollback last deployment
### Escalation
If not resolved in 15 minutes:
→ Page @engineering-lead
If not resolved in 30 minutes:
→ Page @cto + notify @customer-supportThe Observability Stack (Startup Budget)
You don't need to spend $50k/year:
| Component | Budget Option | Premium Option |
|---|---|---|
| Logging | Grafana Loki (free) | Datadog ($$$) |
| Metrics | Prometheus + Grafana (free) | Datadog ($$$) |
| Tracing | Jaeger (free) | Honeycomb ($$) |
| Alerting | Grafana Alerting (free) | PagerDuty ($$) |
| Uptime | UptimeRobot ($7/mo) | Datadog Synthetics ($$$) |
Total budget option: $50-200/month Total premium option: $2,000-10,000/month
Real ROI calculation: One client switched from Datadog ($6,800/month) to self-hosted Grafana stack ($180/month). Saved $79K annually. Time to implement: 2 weeks. The budget stack detected incidents just as fast — the premium was paying for features they never used.
Start with the budget option. Upgrade components that become pain points.
The Implementation Roadmap
Week 1: The Basics
- Structured logging in your application
- Health check endpoint that tests critical dependencies
- Uptime monitoring for your site and API
Week 2: Business Metrics
- Order rate, revenue, and checkout completion dashboards
- Error rate by endpoint
- Tier 1 alerts for checkout and payment
Week 3: Infrastructure Metrics
- Response time percentiles
- Database and cache metrics
- Queue depth monitoring
Week 4: Runbooks
- Write a runbook for every Tier 1 alert
- Run a game day: trigger an alert and follow the runbook
- Iterate on what's missing
The Test
Here's how you know your observability works: at 2 AM, a Tier 1 alert fires. The on-call engineer opens the runbook, follows the steps, and resolves the issue in 15 minutes — without waking anyone else up.
The math that matters:
Bad observability (before):
- Mean time to detection (MTTD): 8.3 minutes (customers report it)
- Mean time to resolution (MTTR): 47 minutes
- Incidents per month: 14
- Downtime per month: 658 minutes (11 hours)
- Revenue lost at $15K/hour GMV: $165K/month
Good observability (after):
- MTTD: 1.2 minutes (alerts fire before customers notice)
- MTTR: 12 minutes (runbooks guide resolution)
- Incidents per month: 14 (same issues, better response)
- Downtime per month: 168 minutes (2.8 hours)
- Revenue lost: $42K/month
Improvement: $123K/month in recovered revenue
Investment: $4K in engineering time to fix alerts + runbooks
ROI: 30x in first month
If that's not your reality, your observability needs work. Not more dashboards — better alerts, better runbooks, and better signal-to-noise ratio.