System Design Fundamentals

SLOs and SLAs

A

SLOs and SLAs

When Promises Become Measurable

Picture this: Your team launches a new payment processing service. The product manager proudly tells investors, “Our system will be super reliable!” But what does “super reliable” mean? 99% uptime? 99.9%? Can the API respond to a payment request in under 100ms or 500ms? Six months later, customers are upset because they expected three things, and you promised five different things to different stakeholders—and nobody actually wrote down what the promises were.

This is where Service Level Objectives (SLOs) and Service Level Agreements (SLAs) come in. They transform vague promises into measurable, enforceable commitments. They answer critical questions: What exactly are we promising? How do we measure success? What happens if we fail? And crucially for system design: what does this promise mean for our architecture, costs, and engineering effort?

SLOs and SLAs are the bridge between business requirements (from Chapter 2.1) and technical design decisions. They tell you exactly which metrics matter and how good they need to be. Without them, you’re guessing. With them, you can make principled architectural trade-offs.

The Hierarchy: SLIs, SLOs, and SLAs

Let’s start with three interconnected concepts that sound similar but mean very different things:

Service Level Indicators (SLIs) are what you measure. They’re the raw metrics about how your system behaves. Examples: How long did that request take? Did it succeed or fail? How many users can your system serve right now? An SLI is a number you can observe, record, and trend over time. The most common SLIs are latency (how fast), error rate (how reliable), and throughput (how much).

Service Level Objectives (SLOs) are what you target. They’re the goals you set for your SLIs. An SLO is a number you’re committing to internally. For instance: “95% of requests will complete in under 200ms” or “error rate will be below 0.1%.” SLOs are promises you make to your team and your product organization. They’re ambitious but realistic—achievable with good engineering but not trivial.

Service Level Agreements (SLAs) are what you contract. They’re the external promises you make to customers, often with financial penalties if you break them. An SLA typically has a looser target than the SLO (to give you breathing room) and includes consequences. For instance: “99.5% uptime per month, or we’ll issue a service credit.” SLAs are legally binding; SLOs are not.

Here’s the relationship:

graph LR
    A["SLI<br/>(Measure)"] -->|"Set targets for"| B["SLO<br/>(Internal Goal)"]
    B -->|"Communicate as"| C["SLA<br/>(External Promise)"]
    C -->|"Back by"| D["Error Budget<br/>(Reliability Currency)"]
    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#fce4ec
    style D fill:#f3e5f5

The key insight: Your SLO should be stricter than your SLA. If your SLA promises 99.5% uptime, your SLO might be 99.9%. This buffer—the difference between your internal target and your external promise—is your safety margin.

Error Budgets: Spending Reliability Like Currency

Here’s a radical idea: you shouldn’t always be 100% reliable, and you shouldn’t try to be.

If you promise 99.9% uptime, you’re actually promising 99.9% success. Rearranged, that means you have a budget of 0.1% downtime (or failures). Over a month, 0.1% of 43,200 minutes is about 43 minutes of allowed downtime. That’s your error budget.

Think of it like money in a bank account. You can:

  • Spend it on shipping features faster (with less testing)
  • Spend it on deploying risky updates
  • Spend it on infrastructure experiments
  • Save it for emergencies (traffic spikes, security patches)

An error budget quantifies the relationship between reliability and velocity. It tells engineers: “You can take calculated risks here because we have budget,” but also “We can’t afford to take risks there because our budget is low.”

Here’s the math for a few common targets:

SLODowntime per monthDowntime per year
99%7.2 hours3.6 days
99.5%3.6 hours1.8 days
99.9%43 minutes8.7 hours
99.99%4.3 minutes52 minutes
99.999%26 seconds5 minutes

Each additional 9 is exponentially more expensive to achieve. This is why defining SLOs is a business and technical decision, not just an engineering one.

Choosing Your SLIs: What Matters Most

Not all metrics are SLIs. An SLI should capture something users actually care about. Let’s categorize common ones:

Latency (Speed): How fast does your system respond? You probably care about percentiles, not averages.

  • p50 (median): 50% of requests are faster than this
  • p95: 95% of requests are faster than this
  • p99: 99% of requests are faster than this

Why percentiles? Because an average is misleading. If 99 requests take 10ms and 1 request takes 10 seconds, the average is ~100ms, but users don’t care. They care that their request might be the one that takes 10 seconds. Percentile targets prevent outliers from hiding in averages.

Error Rate: What percentage of requests fail or return incorrect results? Includes HTTP 5xx errors, timeouts, and incorrect responses. A typical SLI: “less than 0.1% of requests return a 5xx error.”

Availability: Is your service up and reachable? Often measured as uptime percentage. A typical SLI: “99.9% of health checks pass.”

Throughput: How many requests per second can you handle? Less commonly part of an SLO (since it depends on your infrastructure), but important for capacity planning.

Pro tip: Start with latency and error rate. These are the SLIs users notice first. Add others (availability, throughput) once your monitoring is solid.

Setting Realistic SLOs

Here’s where many teams go wrong: they pick numbers that sound good without understanding the cost.

To set a realistic SLO:

  1. Measure current behavior. Run your system for a month and record actual latency percentiles and error rates. Your SLO should be slightly better than what you’re currently achieving—challenging but not impossible.

  2. Understand user expectations. Talk to customers. What latency is “fast enough”? What uptime do they require? A payment processor needs higher reliability than a weather app.

  3. Calculate the cost. Each additional 9 in availability requires redundancy, monitoring, and faster incident response. Getting to 99.9% might cost 2x as much as 99%. Getting to 99.99% might cost 10x. Is the business willing to pay?

  4. Set SLO > SLA. If you promise customers 99.5%, set your internal SLO to 99.8% or 99.9%. This buffer lets you catch problems before they hit customers.

  5. Review and adjust. After three months, check whether your SLOs are realistic. If you’re consistently exceeding them, tighten them. If you’re constantly missing them, either improve the system or relax the targets.

Monitoring and Dashboards

Measuring SLIs requires two things: (1) instrumentation in your code, and (2) a way to aggregate and alert on those metrics.

Here’s a minimal example in pseudocode:

# Instrument a request
start_time = time.time()
try:
    response = process_payment(request)
    latency_ms = (time.time() - start_time) * 1000
    metrics.record_latency(latency_ms)
    metrics.record_success()
    return response
except Exception as e:
    metrics.record_error()
    raise

The metrics are then aggregated by a monitoring system (Prometheus, Datadog, CloudWatch, etc.) and exposed on a dashboard. A typical SLO dashboard shows:

  • Latency percentiles (p50, p95, p99) trending over time
  • Error rate as a percentage
  • Availability uptime percentage
  • Error budget burn rate (how fast you’re using your budget)
  • Alerts when you’re approaching SLO violations

Here’s a Mermaid diagram of how this works:

graph LR
    A["User Request"] -->|"Timed & Monitored"| B["Application Code"]
    B -->|"Records metrics"| C["Metrics Collector"]
    C -->|"Aggregates"| D["Time Series DB<br/>Prometheus/CloudWatch"]
    D -->|"Visualizes"| E["Dashboard"]
    E -->|"Triggers"| F["Alerts"]
    style C fill:#c8e6c9
    style D fill:#bbdefb
    style E fill:#fff9c4
    style F fill:#ffccbc

Did you know? Google publishes SLOs for most of their services. Cloud Load Balancer is 99.95%, BigQuery is 99.9%, and Cloud Storage is 99.99%. These aren’t random—they’re carefully chosen based on the product’s criticality and the cost of achieving them.

Real-World Example: A Payment Processing API

Let’s define SLOs for a payment processor (think Stripe, Square, or PayPal):

Our SLIs:

  • Latency p99: Time to process a payment request end-to-end
  • Error rate: Percentage of payment requests that fail
  • Availability: Percentage of time the API endpoint is reachable

Our SLOs:

  • p99 latency ≤ 500ms (customers expect fast confirmation)
  • Error rate ≤ 0.05% (failures are costly; they cause customer complaints and potential refunds)
  • Availability ≥ 99.95% (payments are critical; downtime directly loses revenue)

Our SLAs (what we promise customers):

  • p99 latency ≤ 1 second (more generous than SLO)
  • Error rate ≤ 0.1% (more generous)
  • Availability ≥ 99.9% (more generous), with 10% service credit per 0.1% below target

This hierarchy means:

  • If we hit our SLOs, we’re well protected against SLA breaches
  • Our error budget for availability is 99.95% - 99.9% = 0.05% = ~21.6 minutes per month
  • If latency drifts but stays under 500ms, customers are happy (even if our SLA allows 1s)

Trade-offs and Tensions

High SLOs are expensive. To achieve 99.99% uptime, you need:

  • Geographic redundancy (multiple data centers)
  • Automated failover (so incidents don’t require human response)
  • Extensive monitoring and alerting
  • Careful deployment practices
  • On-call engineers 24/7

To achieve 99% uptime is much simpler: single data center, manual failover, basic monitoring.

SLOs sometimes conflict with other goals. Your security team wants aggressive rate-limiting; your SLO requires low error rates. Your team wants to experiment with new algorithms; your SLO penalizes latency regressions. Your business wants cheap infrastructure; your SLO requires redundancy. You’ll need to negotiate these trade-offs.

Common mistakes:

  • Setting SLOs without measuring first. You promise 99.99% when you’re currently at 98%.
  • Not communicating SLOs to engineers. They optimize for features, not reliability, if they don’t know what SLO they need to hit.
  • SLA that’s stricter than SLO. Now you have no safety margin and will constantly breach contracts.
  • Ignoring error budget burn. If you’re burning your monthly budget in the first week, you’ll have a bad month.
  • Forgetting about dependent systems. If your API depends on a database with 99.95% availability, your maximum availability is also 99.95%, no matter how hard you try.

Key Takeaways

  • SLIs are measurements (latency p99, error rate), SLOs are targets (< 500ms, < 0.1%), and SLAs are contracts (with financial penalties).
  • Error budgets quantify reliability as currency. A 99.9% SLO gives you ~43 minutes of downtime per month to spend on deployments, experiments, or incidents.
  • Percentile latencies matter more than averages. p99 latency tells you about your worst users; p50 might hide the problem.
  • Set SLO > SLA to maintain a safety buffer. If you promise customers 99.5%, target 99.9% internally.
  • SLOs drive architectural decisions. Higher reliability targets require redundancy, which adds cost and complexity.
  • Measure first, promise second. Define SLOs based on current behavior and business needs, not on what sounds impressive.

Practice Scenarios

Scenario 1: The Startup Problem

You’re building a notification service. Your CEO says, “We need 99.99% uptime because notifications are critical.” Your current infrastructure is a single server running on AWS. What do you do?

  • Should you promise 99.99%? Why or why not?
  • What’s the minimum infrastructure needed to hit 99.99%?
  • What SLO would you actually set, and why?

Scenario 2: The Error Budget Decision

Your search service has a 99.9% SLO (error rate < 0.1%). This month, you’re at 0.08% error rate, and you have budget remaining. Your team wants to:

  • Deploy a new ranking algorithm (risky, might increase errors by 0.02%)
  • Upgrade the database (risky, might cause brief downtime)
  • Add more countries to your service (stable, no risk)

Which can you do, and in what order?

Scenario 3: The Dependency Chain

Your payment API depends on a credit card processor (98% availability), a fraud detection service (99.5% availability), and your own database (99.95% availability). What’s your maximum possible availability, and why?

What’s Next: Capacity Estimation

Now that you know what you need to promise (SLOs) and how to measure it (SLIs), the next question is: how much infrastructure do you need? That’s capacity estimation—calculating compute, storage, and bandwidth to meet your SLOs at expected scale. We’ll see how error budgets, SLOs, and scale all work together to determine your architecture.