KPIs to Monitor
The Overwhelm: Too Many Dashboards, Too Few Answers
You’ve set up your observability stack. Your Prometheus server is collecting 10,000 metrics. You’ve built 50 Grafana dashboards. Developers have created custom dashboards for their teams. Operations has dashboards for infrastructure. The security team has their own. On-call engineers have nightstand dashboards to check on their phones at 2 AM.
Then something breaks. Your pager goes off. You have four seconds to know which of those 50 dashboards tells you what’s actually wrong.
This is why we need KPIs — Key Performance Indicators. These are the metrics that matter. Not the 10,000 you’re collecting, but the 10-15 that actually tell you whether your system is healthy.
Google’s SRE team solved this with the Four Golden Signals. This simple framework tells you 80% of what you need to know about any service. After that, you drill deeper. But if you’re only monitoring these four things, you’re in pretty good shape.
The Four Golden Signals
Signal 1: Latency
How long do requests take to complete?
But here’s the catch: latency is meaningless if you don’t distinguish between successful and failed requests. If a request fails immediately with a 500 error, that’s 50ms latency. A request that hangs for 30 seconds before timing out is 30,000ms latency. One is good, one is terrible.
Always measure latency for successful requests separately from failures.
More importantly, percentiles matter. The average request might take 100ms, but your users care about their individual request. If you’re averaging, 99 requests might complete in 80ms, but 1 takes 10 seconds, and the average is 110ms — which looks fine on a dashboard.
Track these percentiles:
- p50 (median): 50% of requests are faster than this. Users notice this directly.
- p95: 95% of requests are faster than this. Some of your users are experiencing this latency.
- p99: 99% of requests are faster than this. Your worst users, 1 in 100 requests, are experiencing this.
The p99 latency tells the truth about your system. It’s not smooth marketing-friendly. It’s the reality your slowest users experience.
GET /api/orders HTTP/1.1
200 OK in 85ms <- typical
200 OK in 3500ms <- bad, but not rare enough to alert on
200 OK in 8000ms <- timeout, triggers retry
500 in 50ms <- error
Latency metrics:
success_latency_p50: 85ms (good)
success_latency_p99: 3500ms (yikes! why so high?)
error_latency_p50: 50ms (errors are fast)
error_rate: 0.1% (acceptable, but investigate)
Signal 2: Traffic
How much demand is your system handling?
Traffic is the rate of requests hitting your service. Measure it as requests per second, but also break it down by endpoint:
api.orders.post: 500 req/s
api.orders.get: 2000 req/s
api.health: 50 req/s (liveness probes)
Traffic baseline is critical. You need to know your normal traffic pattern:
- Peak hour traffic: 5000 req/s
- Off-peak (2-4 AM): 500 req/s
- Weekly patterns: Friday afternoon spikes vs Sunday dips
Sudden traffic changes signal problems: a 10x traffic spike might be a DDoS, a viral tweet sending unexpected traffic, or a buggy client retrying infinitely.
Traffic metrics also help with capacity planning. If you’re growing 20% month-over-month, you need to increase capacity before you hit the ceiling.
Signal 3: Errors
What percentage of requests are failing?
Error rate is usually the most important KPI. A 0.1% error rate on 10,000 requests/second means 10 requests per second are failing. That’s 864,000 failed requests per day. Is that acceptable?
It depends on your SLA. For a critical payment system, probably not. For a weather API, maybe. But you need to know your tolerance and alarm when you exceed it.
Track error types separately:
- 5xx errors (your fault): database connection errors, unhandled exceptions, service crashes
- 4xx errors (client fault): malformed requests, authentication failures, missing resources
- Timeout errors: requests that took over 30 seconds and gave up
- External API errors: third-party services failed
error_rate_5xx: 0.05% <- you should fix this
error_rate_4xx: 0.2% <- expected, legitimate client errors
error_rate_timeout: 0.02% <- could indicate saturation
external_api_errors: 0.1% <- Stripe is having issues
The 4xx errors are usually healthy noise (invalid requests, auth failures). The 5xx errors and timeouts are the canaries in your coal mine.
Signal 4: Saturation
How full is your system?
Saturation is the resource constraint that’s about to become a bottleneck. For a web server, it’s CPU, memory, and active connections. For a database, it’s connection pool usage, replication lag, and query queue depth. For a cache, it’s hit ratio and eviction rate.
Saturation is the hardest signal to define because it’s different for every component:
Web Server Saturation:
- CPU usage: above 70% means you’re approaching the limit
- Memory: above 80% leaves no buffer for spikes
- Active connections: if your pool size is 100 and you’re at 95, you’re one spike away from rejecting requests
Database Saturation:
- Connection pool usage: above 80% is concerning
- Query queue depth: how many queries are waiting to execute?
- Replication lag: if a write is slow, replicas will fall behind
- Disk usage: above 80% on database disk leaves no room for growth
Cache Saturation:
- Hit ratio: below 80% means too many cache misses, lots of expensive operations
- Eviction rate: how often is the cache evicting old data? High evictions mean the cache is too small
Why is saturation critical? Because saturation predicts degradation. If your database connection pool is 90% full, the next spike will cause rejections and timeouts. You want to see saturation climbing and add capacity before it causes errors.
db_connection_pool_usage: 45% <- healthy
db_connection_pool_usage: 75% <- concerning, monitor closely
db_connection_pool_usage: 92% <- ALERT! Add connections now
Percentiles: Why Averages Lie
Let me illustrate with a concrete example. You’re monitoring request latency for your checkout API:
Request latencies over 1 minute:
100 requests at 50ms
100 requests at 75ms
100 requests at 90ms
100 requests at 105ms
100 requests at 120ms
1 request at 30,000ms (database hung)
Average: (100*50 + 100*75 + 100*90 + 100*105 + 100*120 + 1*30000) / 501
= (5000 + 7500 + 9000 + 10500 + 12000 + 30000) / 501
= 74,000 / 501
= 147.7ms
p50: 85ms (half the requests)
p95: 110ms (95% of requests)
p99: 120ms (99% of requests)
p99.9: 30,000ms (that single terrible request)
The average of 147.7ms is misleading. It’s being pulled up by one terrible outlier. But the p99 of 120ms actually tells you: “For 99 out of 100 requests, latency is under 120ms.” That’s more useful.
For SLAs, you’ll commit to percentiles: “99% of requests complete within 200ms.” Not “average latency is 100ms.”
SLIs, SLOs, SLAs Revisited
These terms matter more now that you understand what metrics to track:
SLI (Service Level Indicator) = the actual metric you measure
success_rate_sli = successful_requests / total_requests = 99.5%
latency_sli = requests_under_200ms / total_requests = 98%
availability_sli = service_uptime / total_time = 99.95%
SLO (Service Level Objective) = the target you set
success_rate_slo: 99% (target: at least 99% of requests succeed)
latency_slo: 95% of requests under 200ms (target: p95 latency <= 200ms)
availability_slo: 99.9% (target: 43 minutes of downtime per month)
SLA (Service Level Agreement) = the contract promise with consequences
"If availability drops below 99.9%, customers get a 10% refund."
For your own services, you care about SLIs and SLOs. The SLA is what you commit to customers.
Here’s the critical part: your SLO should be harder than what you think you can achieve. If you can consistently hit 99.95% availability, your SLO should be 99.9%. This gives you a buffer. If you commit to 99.95% and rarely miss it, you have no warning before you violate your SLA.
Method: RED and USE
We introduced RED and USE before, but let’s detail them for KPI selection:
RED Method (Request-Driven Services)
Rate: Requests per second, broken down by endpoint and method.
Prometheus query:
rate(http_requests_total[5m])
Errors: Error percentage by status code.
Prometheus query:
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
Duration: Latency percentiles.
Prometheus query:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
USE Method (Infrastructure/Resources)
Utilization: What percent of the resource is in use?
cpu_usage: 65%
memory_usage: 72%
disk_usage: 54%
connection_pool_usage: 80%
Saturation: How much is queued waiting for the resource?
cpu_runqueue: 2.5 (processes waiting for CPU)
memory_swap_usage: 1% (pages swapped to disk = disk thrashing)
disk_queue_depth: 5 (disk I/O operations waiting)
thread_pool_queue_depth: 150 (tasks waiting for threads)
Errors: How many errors from the resource?
network_packet_drops: 0.01%
disk_io_errors: 0
out_of_memory_kills: 0
connection_timeouts: 2
Building Your Dashboard
A simple, effective dashboard for any service looks like this:
| Signal | Metric | Widget Type | Threshold |
|---|---|---|---|
| Latency | p50 latency | Line graph | N/A (for trending) |
| Latency | p95 latency | Line graph | Alert if over 300ms |
| Latency | p99 latency | Line graph | Alert if over 1000ms |
| Traffic | Requests/sec | Line graph | N/A (for trending) |
| Errors | Error rate | Line graph | Alert if over 1% |
| Errors | 5xx rate | Line graph | Alert if over 0.1% |
| Saturation | CPU usage | Gauge | Alert if over 80% |
| Saturation | Memory usage | Gauge | Alert if over 85% |
| Saturation | Connection pool | Gauge | Alert if over 90% |
That’s it. Nine metrics that tell you 95% of what you need to know.
Everything else is drill-down. If the error rate is high, you’ll look at logs. If latency is bad, you’ll trace requests. If CPU is high, you’ll profile processes. But the dashboard above catches 95% of real problems.
The Alert Fatigue Problem
Here’s a war story: a company installed 150 alerts. The on-call engineer received so many notifications that they stopped reading them. When a critical database failure happened, the alert was buried in 50 other false alarms. The engineer didn’t see it until customers started complaining.
More alerts don’t make you safer. Better alerts do.
Here’s the heuristic:
- Critical alerts: Only fire when you need to wake someone up. “Database is down.” “Error rate above 5%.” “Disk full.”
- Warning alerts: Pages during business hours but doesn’t wake anyone. “P99 latency is 1.5x normal.” “CPU trending high.”
- Info alerts: Logged but no notification. “Deployment succeeded.” “Cache was flushed.”
Calculate the alert’s value as: (cost of missing the problem) - (cost of false alarms). If false alarms are more expensive than the problem they’re meant to catch, disable the alert.
Alert: "CPU above 70%"
Cost of missing: If CPU is continuously 80%, we might degrade or crash
Value: $500 (few customers impacted initially)
Cost of false alarms:
- Engineer context-switches (15 min)
- Adds to alert fatigue (harder to notice real problems)
Cost: 1 false alarm per week * 15 min * salary + lost focus
= $200/month
Decision: This alert's value ($500) outweighs false alarm cost ($200).
But if false alarms are daily, disable it.
Business Metrics
The Four Golden Signals are technical metrics. But a real observability strategy includes business metrics:
checkout_completion_rate: 94% (% of customers who start checkout and complete it)
signup_rate: 1000/hour (new users per hour)
payment_success_rate: 99.2% (% of payment attempts that succeed)
refund_request_rate: 0.5% (% of completed orders that are refunded)
revenue_per_minute: $12,000 (business outcome, not technical metric)
When you tie technical metrics to business outcomes, you get better prioritization. A 1% error rate doesn’t matter much if only 1% of users are affected. But a 1% error rate in checkout loses you $120/minute in revenue.
Key Takeaways
- The Four Golden Signals (Latency, Traffic, Errors, Saturation) are the foundation of any monitoring strategy. Master these, and you’ll catch 95% of real problems.
- Percentiles matter more than averages. Always monitor p50, p95, and p99. The p99 tells you the truth about your system’s performance.
- Saturation is predictive. Rising saturation warns you before it causes failures.
- Fewer, better alerts beat more alerts. Alert fatigue is a real problem. Ruthlessly prune alerts that don’t matter.
- Tie technical metrics to business outcomes. A technical improvement is only valuable if it impacts the customer experience or revenue.
- SLOs should be harder than you can actually achieve to give you warning before you breach SLAs.
Practice Scenarios
Scenario 1: You manage a video streaming service. The Four Golden Signals dashboard you inherited is alerting on “average video startup time over 2 seconds.” You notice that 98% of users have startup times under 1 second, but 2% wait 10+ seconds (users on slow connections). Your pager went off last night due to a 3-second average (driven by one slow user). Should you keep this alert? What would you change?
Scenario 2: Your e-commerce platform’s checkout API is being monitored. Currently, you’re alerting if error rate goes over 2%. However, when you cross-reference with the billing system, you realize a 0.5% error rate in checkout costs the company $50,000/day in lost revenue. What SLO should you set? What alert should you configure?
Scenario 3: You’re launching a new feature (live notifications) in your app. Design a dashboard with business metrics and technical metrics that would help you assess: (1) is the feature working reliably? (2) is it providing value to users? (3) when should you scale or optimize it?
Next: We’ve identified what metrics matter and how to alert on them. But metrics show you that something is wrong. Distributed tracing shows you where it’s wrong. In the next section, we’ll explore how to instrument services for tracing and use traces to diagnose complex failures in microservices architectures.