Alerting & Incident Response

The Beautiful Dashboard That Nobody Watches

Your monitoring setup is gorgeous. Grafana dashboards with perfect visualizations. Metrics flowing in from every service. Logs are centralized and searchable. You’ve done everything right.

Then 2 AM arrives. Your database fills up to 99% capacity. The application starts queueing requests. Response times triple. Customers start complaining on Twitter: “Is your service down? I can’t place my order.”

Your monitoring dashboard would have shown all of this if anyone were watching. But nobody was. A security camera pointed at an empty room is useless. What you need is alerting: automated systems that detect problems and notify the right person, at the right time, with enough context to act.

But here’s the trap: bad alerting is worse than no alerting. Imagine 500 alert emails a day, 95% of them noise. Engineers stop reading them. The one critical alert gets lost in the inbox. This is “alert fatigue,” and it’s deadly.

The goal of alerting is simple but elusive: actionable alerts that wake the right person at the right time with enough context to start solving the problem.

What Makes a Good Alert?

Not every problem deserves an alert. Let’s clarify the distinctions:

Actionable: The alert should tell you what action to take. “Database is 95% full” is actionable (run the cleanup job, increase capacity, investigate large queries). “CPU is 78%” is not actionable (high CPU might not matter if error rates are normal). This is the distinction between alerting on symptoms vs. causes.

Urgent: Some problems need immediate attention (payment processing is down), others can wait until business hours (test environment metrics anomaly). If you alert on non-urgent things, people ignore them.

Contextual: The alert should include enough information to start investigating: which service? which endpoint? how many errors? affected how many users?

Let’s categorize alerts by severity:

CRITICAL (P1): Pages someone immediately. Revenue impact, customer-facing outage, data loss risk. Example: “payment processing error rate above 5%.”
WARNING (P2): Should be investigated today. Degradation that isn’t breaking things yet, but trending badly. Example: “database query latency p99 above 2 seconds.”
INFO (P3): FYI, for dashboards. Interesting events but not urgent. Example: “10 pods restarted in the last hour.”

Most teams make the mistake of alerting on too many things, expecting the on-call person to triage. This causes alert fatigue. Instead, be ruthless: only alert on things that genuinely need human intervention right now.

SLO-Based Alerting: Alert When It Matters

The most advanced teams don’t alert on metrics; they alert on SLOs (Service Level Objectives).

An SLO is a commitment: “99.9% of requests will succeed within 200 ms, measured monthly.” This allows an error budget: 0.1% of requests can fail, or equivalently, 43 seconds of downtime per month (if measured continuously, not monthly).

Naive alerting: “Error rate above 0.1%” — this goes off the moment you exceed budget, but by then you’re already affecting customers.

Burn rate alerting is smarter. It asks: “At the current rate of burning through our error budget, when will we run out?” If we have 100 errors left in our budget for the month (30 days), and we’re currently burning 10 errors per minute, we’ll run out in 10 minutes. That’s an urgent alert. If we’re burning 1 error per hour, we can wait for business hours.

SLO: 99.9% success rate
Error budget per month: 0.1% of requests
If we have 1M requests/day: 86,400 errors allowed per month

Burn rate alert rules:
- Short window (5 minute): If we burn 5% of budget per 5 min → alert
  (Extrapolates to emptying budget in ~100 minutes)
- Long window (1 hour): If we burn 1% of budget per hour → alert
  (Extrapolates to emptying budget in ~100 hours)

This catches both fast catastrophes (5 minute alert) and slow degradation (1 hour alert), without spurious alerts for tiny blips.

The Incident Response Lifecycle

When an alert fires, a sequence of events unfolds. Good organizations have practiced and refined this:

Detection
   ↓
Triage (Is it real? How severe?)
   ↓
Mitigation (Stop the bleeding)
   ↓
Resolution (Fix the root cause)
   ↓
Post-Mortem (Learn for next time)

Detection: The monitoring system notices the problem. This should be fast (minutes, not hours).

Triage: Is the alert real or a fluke? How many users are affected? Is it a complete outage or partial degradation? This usually takes 1-5 minutes for practiced teams.

Mitigation: Stop the bleeding immediately. This might be reverting a bad deploy, scaling up instances, rerouting traffic, or disabling a broken feature flag. The goal is to restore service to users ASAP, not necessarily to understand why it happened.

Resolution: Investigate the root cause and fix it properly. This is slower (hours, sometimes days) but prevents recurrence.

Post-Mortem: After the dust settles, the team meets to reconstruct the timeline, identify the root cause, and agree on action items to prevent it next time.

Alert Routing and On-Call

When an alert fires, who gets notified? This is handled by alert routing systems like PagerDuty, Grafana OnCall, or OpsGenie.

Typically, there’s a rotating on-call schedule. Monday–Friday, the backend team rotates who’s on-call. Weekends and nights might have a different schedule. The alert goes to the person currently on-call for that service.

Most teams use escalation policies:

Alert fires at 2 AM.
Page the primary on-call engineer.
If they don’t acknowledge within 5 minutes (maybe they’re in the shower), page the secondary.
If no one acks after 10 minutes, page the engineering manager.
If still no response after 15 minutes, wake up the VP of Engineering (this rarely happens; it signals something catastrophic).

Good on-call tools have one-click mitigations: acknowledge alert (stop paging), snooze for 30 minutes (I see it and I’m working on it, don’t re-page), resolve alert (I fixed it).

Runbooks: Your Incident Playbook

An alert with no context is useless. “Error rate spiking” — now what? Instead, attach runbooks to alerts: pre-written instructions for what to check and how to respond.

A runbook for “Payment Processing Error Rate Above 5%” might look like:

ALERT: Payment Processing Error Rate Exceeds 5%

CONTEXT:
- Check recent deployments in #deployments Slack channel
- Check the payment service logs:
  - Are there database connection errors?
  - Are there timeouts from the payment gateway?
  - Are there authentication failures?

MITIGATION (in priority order):
1. Check if a deploy happened in the last 30 minutes
   - If yes, consider rolling it back immediately
   - Command: kubectl rollout undo deployment/payment-service
2. Check database connections
   - Command: SELECT COUNT(*) FROM pg_stat_activity
   - If above 200, we're hitting connection limit (max is 250)
   - Action: Scale up RDS to larger instance type
3. Check payment gateway status
   - Visit: https://status.stripe.com/
   - If they're having issues, it will show there
   - Action: Post to #incidents "Stripe having issues"
4. If none of the above:
   - Join the #incidents Slack channel
   - Page the payment team lead
   - Escalation: VP Engineering if payment is still broken after 15 min

ESCALATION:
- If you can't reach anyone in 5 minutes, page up
- Incident commander: Call <on-call manager>
- Critical: Page the CEO (only if we're losing money this second)

FOLLOW-UP:
- After incident, create post-mortem ticket
- Assign to payment team

Great runbooks save 5-10 minutes on every incident. Over time, they accumulate into a tribal knowledge that new engineers can leverage.

Pro Tip: Keep runbooks next to your alerts. In Grafana, add a link to the runbook in the alert annotation. When the alert fires, the engineer sees the runbook right there, not buried in a wiki somewhere.

Incident Severity Classification

Not all incidents are equal. The industry standard is SEV-0 through SEV-4:

SEV-0 (Critical): Complete outage, many users affected, revenue impact. CEO should be informed. All hands on deck. Page everyone.
SEV-1 (Urgent): Significant impact, some users affected, major degradation. Page on-call team, engineering leads. Expect updates every 15 minutes.
SEV-2 (Major): Moderate impact, limited users affected. Page relevant service team. Updates every hour.
SEV-3 (Minor): Small impact, few users affected, mostly annoying. Create ticket, notify team, investigate when you have capacity.
SEV-4 (Informational): No user impact, just tracking it. Ticket only.

The triage process (first 5 minutes) determines the severity. This informs how many people get paged, how quickly they’re expected to respond, and when escalation happens.

The Incident Commander Role

In SEV-0 incidents, a senior engineer becomes the “Incident Commander” (IC). Their job is not to fix the problem (let specialists do that) but to orchestrate:

Keep a timeline of events (for the post-mortem)
Coordinate between teams (database, backend, frontend, etc.)
Update stakeholders and customers (via status page)
Make decisions if there are conflicts
Escalate if needed

The IC is not the person deepest in the problem; they’re above the fray, organizing. Good IC training is invaluable for big incidents.

Post-Mortems: Learning From Failure

After an incident is resolved, schedule a post-mortem meeting (within 24 hours while details are fresh). The goals:

Reconstruct the timeline: What happened? 2:00 AM — deployment. 2:03 AM — error rate spikes. 2:05 AM — alert fires. 2:10 AM — engineer acknowledged alert. 2:15 AM — root cause identified (query N+1). 2:45 AM — fix deployed. 3:00 AM — error rate back to normal.
Find the root cause: Not “we deployed bad code” (symptom) but “we didn’t run the new query against production-size data before deploying” (root cause).
Be blameless: The goal is to learn, not to blame individuals. “Why did the engineer deploy without testing?” → “Because our testing process doesn’t catch N+1 queries. How do we fix that?”
Action items: Concrete things to prevent recurrence. “Add a query performance check to the CI pipeline.” “Increase deployment tests for database queries.”

Follow the “5 Whys” technique:

Problem: Payment processing failed
Why? Error rate spiked to 50%
Why? Database query was slow
Why? We added a loop that queries the database per-item
Why? The engineer didn't test against production data volume
Why? Our staging environment is 10x smaller than production

Root cause: Staging environment doesn't match production, so issues don't surface before deployment

Action: Sync staging and production data weekly, or use a production clone for critical tests.

Good post-mortems create institutional memory. New engineers read past post-mortems and avoid the same mistakes.

Did You Know? Some teams have monthly “Chaos Engineering GameDays” where they intentionally break things in production (in a controlled way) to test alerting, incident response, and runbooks. It’s like a fire drill for your infrastructure. These reveal surprising gaps: alerts that don’t fire, runbooks with wrong commands, on-call engineers who aren’t actually contactable.

When Alerts Go Wrong

Alerting is hard. Common failures:

Alert fatigue: Too many alerts, most false positives. Engineers stop reading. The critical alert gets lost.

Cascading false positives: Service A health-checks Service B. Service B is briefly slow. Service A alerts. Service C health-checks Service A. Service C alerts. Suddenly you have 100 alerts for a 30-second hiccup in Service B. Use dependencies carefully.

Alerts that trigger the problem they measure: Monitoring is expensive. If your monitoring itself causes high CPU/memory, you might alert on the problem your monitoring created. Use passive monitoring where possible.

Alerts that can’t be fixed immediately: If your alert requires a 2-hour manual procedure, paging someone at 2 AM is cruel. Use runbooks to enable fast mitigation, or don’t alert (downgrade to SEV-3).

Trade-offs in Alerting

Sensitivity vs. Specificity: More sensitive alerts catch all problems but have more false positives. Less sensitive alerts miss some problems but are more actionable. There’s no perfect threshold; find the balance for your team.

Alert noise vs. on-call burden: Fewer alerts = happier engineers, but more incidents might slip through. More alerts = better detection, but increased on-call fatigue.

Alerting tool cost: PagerDuty, Grafana OnCall, etc., charge per incident or per user. Free tools have fewer features.

Engineer well-being: On-call is stressful. 2 AM pages damage sleep and morale. Invest in good runbooks, automation, and reasonable rotation periods. Most teams aim for 1 week on-call per month per engineer, with another on-call as secondary backup.

Key Takeaways

Alerting without monitoring is useless; monitoring without alerting is a security camera nobody watches. You need both.
Alert on symptoms (error rate), not causes (CPU usage). High CPU without user impact doesn’t deserve an alert.
Alert on SLOs and burn rate, not arbitrary thresholds. This ties alerts to business impact and prevents alert fatigue.
Runbooks are your incident playbook. They save 5-10 minutes per incident and let junior engineers handle SEV-2s independently.
The incident response lifecycle — detect, triage, mitigate, resolve, post-mortem — turns incidents into learning opportunities. Blameless post-mortems build trust and prevent recurrence.
Test your alerts with GameDays. An alert that never fires might be broken. Chaos engineering catches these.

Practice Scenarios

Scenario 1: The 2 AM Payment Outage At 1:57 AM, your payment service error rate crosses 5%. The CRITICAL alert fires. The on-call engineer’s phone buzzes. She logs in and sees the runbook: “Check recent deployments, check database connections, check payment gateway status.”

Deployment 30 minutes ago. She checks the payment service logs:

error_code="STRIPE_AUTHENTICATION_FAILED"

She checks Slack #deployments: “Deploy changed Stripe API key (updated in secrets manager).” But it looks like there’s a mismatch—the code is reading the old key from cache. She immediately escalates to the Stripe integration owner: “New secret didn’t get reloaded. Rollback or restart payment pods?”

They restart the payment pods. Within 2 minutes, error rate drops back to 0.1%. Full recovery in 12 minutes from initial alert. Later, post-mortem: “Add a healthcheck that validates Stripe authentication on startup.”

Scenario 2: The Alert That Cried Wolf Your team gets paged on “API latency p99 above 2 seconds.” They check and it’s… fine. Error rate is normal, users are happy. The on-call engineer silences the alert and goes back to sleep. Three days later, the same alert fires during an actual degradation. Now the engineer is skeptical. “Probably another false positive.” Doesn’t investigate. Customers get upset.

Lesson: Alert thresholds matter. Adjust the p99 threshold from 2 seconds to 3 seconds (was too sensitive). Or add logic: “only alert if p99 is above 2 seconds AND error rate is above 1%.” This prevents false positives.

Scenario 3: The Cascade Service A’s health check fails. Service A alerts. But Service A is actually fine; Service B (its dependency) is slow. Now the alerting system gets 50 alerts from all services that depend on B. The on-call engineer’s phone is buzzing constantly. They’ve lost situational awareness.

Better approach: Service A’s health check should report separately on its own health vs. its dependency status. It shouldn’t alert on dependency slow-ness; the dependency’s own alert should fire instead.

Next, we need to make sure services are actually alive and can accept traffic. This is where health checks come in.