System Design Fundamentals

Uptime & Availability Metrics

A

Uptime & Availability Metrics

The CEO’s Question

Your CEO walks into the engineering standup and asks: “What’s our uptime?” You confidently answer: “99.9%.” She leans back and asks: “Is that good enough?”

You pause. The truth is, 99.9% sounds excellent until you translate it into real time. It’s roughly 8.7 hours of downtime per year. For many businesses, that might be acceptable. For others, it’s a disaster. But here’s what really matters: understanding what 99.9% actually means for your system, your users, and your business decisions.

The difference between 99.9% and 99.99% seems trivial—just 0.09% of a percentage point. Yet it represents the difference between 8.7 hours and 52 minutes of annual downtime. That’s the power of understanding availability metrics. This foundation determines how much you’ll invest in redundancy, how your architecture evolves, and ultimately, whether your users can trust you.

The “Nines” Framework

Let’s make availability concrete. The industry uses the “nines” shorthand to describe availability levels:

AvailabilityCommon NameAnnual DowntimeMonthly DowntimeWeekly Downtime
99%Two nines3.65 days7.2 hours1.68 hours
99.9%Three nines8.76 hours43.2 minutes10.08 minutes
99.99%Four nines52.6 minutes4.32 minutes1.01 minutes
99.999%Five nines5.26 minutes25.9 seconds6.05 seconds
99.9999%Six nines31.5 seconds2.59 seconds0.6 seconds

Notice how the downtime decreases exponentially. Going from two nines to three nines cuts annual downtime by 97%. But going from four to five nines only cuts it by 90%—you’re getting diminishing returns. This is why five nines (99.999%) is often considered the practical maximum for most systems. Six nines requires such extraordinary engineering that the cost becomes prohibitive.

Pro Tip: When discussing uptime with non-technical stakeholders, always convert percentages to actual downtime. “99.9% uptime” means nothing to a business person. “Less than 9 hours of annual downtime” makes the commitment real.

Key Metrics: MTBF, MTTR, and MTTF

To truly understand availability, you need three related metrics:

  • Mean Time Between Failures (MTBF): The average time between when one failure occurs and when the next one happens. For a reliable system, MTBF is large (good). For an unreliable system, MTBF is small (bad). Measured in hours, days, or years.

  • Mean Time To Recovery (MTTR): How long it takes to restore service after a failure is detected. Includes detection time, diagnosis time, and fix/failover time. You want MTTR to be small. Measured in seconds or minutes.

  • Mean Time To Failure (MTTF): How long a new component will run before its first failure. Similar to MTBF but specifically for new systems. Used more in manufacturing; less common in software.

These metrics combine into the fundamental availability formula:

Availability = MTBF / (MTBF + MTTR)

Let’s work through an example. Imagine a database that experiences a failure every 720 hours (30 days) on average, and each failure takes 15 minutes to detect and fix:

Availability = 720 / (720 + 0.25) = 720 / 720.25 = 0.9996531 ≈ 99.97%

Notice that you can improve availability either by increasing MTBF (making failures less frequent through better engineering) or by decreasing MTTR (making recovery faster through better automation and monitoring).

From MTTR Down: The Importance of Detection

Here’s a secret that operations teams know: failures don’t hurt you when they happen—they hurt you when they remain undetected. A database instance can fail silently for 30 minutes while your monitoring was sleeping. Those 30 minutes count in your downtime window, even though the failure itself took only 2 seconds.

This is why detection speed is critical. Detection involves:

  1. Synthetic monitoring — automated tests that continuously probe your system’s health (ping, check API endpoints, verify database connectivity)
  2. Real user monitoring (RUM) — tracking actual user experience through browser/app instrumentation
  3. Alerting thresholds — detecting anomalies (response time spike, error rate increase, CPU utilization jump)

The faster you detect, the sooner MTTR begins ticking down.

SLAs, SLOs, and SLIs: The Promise Framework

Building on Chapter 2’s concepts, let’s see how these connect to availability:

  • SLI (Service Level Indicator): The measurement—“99.9% of requests complete in under 200ms”
  • SLO (Service Level Objective): The target—“We commit to maintaining 99.9% availability”
  • SLA (Service Level Agreement): The contract—“If we drop below 99.9%, customers get a service credit”

For availability specifically, the SLI is usually “percentage of requests served successfully” or “percentage of time the system is responsive.” The SLO is your internal commitment. The SLA is your business commitment.

Did you know: Google, Amazon, and Microsoft offer SLAs with financial penalties if they miss commitments. Azure bills at a premium for 99.99% availability, charging more because delivering that higher reliability requires more engineering investment.

Composite Availability: The Chain is Only as Strong

Most systems aren’t single components—they’re chains of dependencies. To calculate the availability of the entire system, you need to understand how to combine component availabilities.

Serial Dependencies (Everything Must Work)

If your architecture is Web Server → Application Server → Database, all three must be available for the system to function. Each failure point compounds:

System Availability = Availability(Web) × Availability(App) × Availability(DB)

Example: If each component has 99.9% availability (0.999):

System Availability = 0.999 × 0.999 × 0.999 = 0.997 ≈ 99.7%

Notice how the system availability (99.7%) is worse than any individual component (99.9%). This is why serial architectures are dangerous.

Parallel Dependencies (Redundancy Helps)

With redundancy, you can improve availability. If you have two databases and the system can function with either one available:

System Availability = 1 - (1 - Availability(DB1)) × (1 - Availability(DB2))

With two 99.9% available databases:

System Availability = 1 - (1 - 0.999) × (1 - 0.999) = 1 - 0.001 × 0.001 = 1 - 0.000001 ≈ 99.9999%

That’s a 100x improvement in availability just by adding redundancy! This is why every critical component should have backup.

Error Budgets: Spend Your Failures Wisely

Google SRE introduced the error budget concept: if your SLO is 99.9% availability, you’re allowed a failure budget of 0.1%. Over a year, that’s roughly 8.7 hours you can “spend” on failures, maintenance, deployments, or experiments.

The key insight: You should use your error budget, not save it.

If you have a 0.1% error budget and you’ve used only 0.05% through the year, that means you over-engineered your system. You spent resources on reliability you didn’t need to spend. On the flip side, if you hit your budget in month 6, you can’t take on any risk for the remaining months—no risky deployments, no infrastructure experiments, no pushing features.

Error budget policy answers: “What do we do when we burn through our budget?” Options include:

  1. Stop all risky changes until the budget resets
  2. Increase MTTR investment with better monitoring and automation
  3. Increase MTBF investment with deeper testing and quality processes
  4. Renegotiate the SLO if it’s genuinely unachievable

Measuring Availability in Practice

How do you actually measure whether you hit your SLO? Three approaches:

1. Request-based measurement: “What percentage of user requests succeeded?” Most common for web applications. A single successful HTTP request counts as one unit of availability.

2. Time-based measurement: “Was the system available during this time window?” Common for batch systems or infrastructure. The system either works or doesn’t for the entire minute/hour.

3. User-based measurement: “How many users could access the system?” Accounts for geographic issues—maybe users in Europe can’t reach your data center, while US users are fine.

Most teams use request-based because it’s easiest to instrument and aligns with how users experience your system.

Planned vs. Unplanned Downtime

Here’s a controversial question: if you bring down the system for a planned maintenance window, does that count toward your downtime? Different organizations answer differently:

  • Strict SLO: Planned or unplanned, it’s downtime. You must schedule maintenance during low-traffic windows or use blue-green deployments to avoid SLO impact.
  • Flexible SLO: Planned downtime doesn’t count if announced in advance. You get “maintenance windows” where availability targets don’t apply.

The strict approach is more honest but requires more sophisticated deployment techniques. The flexible approach is more practical but can hide poor engineering practices.

Key Takeaways

  • The “nines” are concrete—99.9% means roughly 8.7 hours of annual downtime, not a vague concept
  • Availability = MTBF / (MTBF + MTTR)—improve it by failing less often or recovering faster
  • Composite availability multiplies in serial systems (bad) and compounds in parallel systems (good)
  • Error budgets are meant to be spent, guiding where to invest your reliability resources
  • Measuring availability requires choosing a model (request-based, time-based, or user-based)
  • The gap between four nines and five nines is enormous in engineering cost; understand your business’s true needs before over-engineering

In the next section, we’ll look at how to identify and eliminate the single points of failure that prevent systems from reaching high availability targets.