System Design Fundamentals

Metrics, Logs & Traces

A

Metrics, Logs & Traces

The 3 AM Page and the Three Pillars of Observability

It’s 3 AM. Your pager goes off. Users are reporting that the checkout page is loading in 8 seconds instead of the usual 2 seconds. Your system has 15 microservices running in Kubernetes. The API gateway? Looks fine. The order service? Also fine. The payment service? Seems normal. You’re standing in the dark, restarting services randomly, hoping something sticks.

This is the observability nightmare.

With proper observability — specifically, the three pillars of metrics, logs, and traces — you can pinpoint the exact service, the exact request, and the exact database query causing the issue in under five minutes. You’ll see that the product service is suddenly making 10x more calls to the database, even though the code hasn’t changed. You’ll trace a single request and watch it hang for 6 seconds waiting for a cache to respond.

This chapter builds on the availability and reliability patterns we discussed earlier. We can’t design resilient systems without being able to observe them. Observability is how you turn a pager at 3 AM into a quick fix instead of a crisis.

What Is Observability? It’s Not Just Monitoring

Let’s start with a critical distinction: monitoring is not observability.

Monitoring is checking if known metrics stay within expected ranges. You monitor CPU, you monitor error rates, you monitor request latency. Monitoring tells you when something is wrong.

Observability is the ability to understand your system’s internal state based solely on examining external outputs. It’s the property of a system that allows you to ask arbitrary questions about what happened without having to instrument the code beforehand.

Here’s the difference: with monitoring, you ask “Is request latency above the threshold?” With observability, you ask “Why did request latency spike to 8 seconds for this specific user in Asia?”

The Three Pillars: Metrics, Logs, and Traces

Think of these as complementary tools, each answering different questions:

Metrics answer “WHAT?” — they’re numeric measurements sampled over time. Counters (total requests ever), gauges (current CPU usage), histograms (distribution of request latencies). Metrics give you the big picture: “This service is handling 5,000 requests per second, 0.1% are erroring, and the p99 latency is 250ms.”

Logs answer “WHY?” — they’re discrete events with context, typically from application code. When something goes wrong, logs contain the details: “NullPointerException in PaymentProcessor.charge() at line 42. CustomerID: 12345, Amount: $99.99.” Logs are the evidence.

Traces answer “WHERE?” — they’re the journey of a single request across your entire system. A request enters your API gateway, bounces through 5 microservices, hits 3 databases, calls 2 external APIs, and you can see the exact timing of every hop. A trace shows you that the product service spent 4 seconds doing something, even though each individual service’s metrics looked fine.

Here’s the magic: together, they form a complete picture. Metrics alert you that something is wrong. Logs explain why it’s wrong. Traces show you exactly where in your architecture the problem lives.

A Hospital Analogy

Imagine you’re a doctor visiting a hospital patient:

Metrics are like the vital signs monitor: heart rate, blood pressure, oxygen saturation, temperature. Continuous numeric readings that give you the patient’s overall status at a glance.

Logs are like the nurse’s notes in the patient’s chart: “Patient reported chest pain at 2:15 PM. Complaining of shortness of breath. Administered oxygen. Heart rate climbed from 80 to 110 BPM.”

Traces are like the full medical history chain: the patient came through the ER, was triaged, had an EKG, went to radiology for a chest X-ray, saw the cardiologist, had an ultrasound, and is now in recovery. You can see the full journey and understand what led to the current state.

No single view is complete without the others. The vital signs monitor tells you the patient’s heart rate is elevated, but you don’t know why. The nurse’s notes tell you the patient complained of pain, but you don’t know if they had a heart attack or anxiety. The full history chain shows you the complete picture.

Metrics: The Continuous Signal

Let’s dig deeper into metrics. In production systems, you’re collecting measurements constantly. The challenge is doing it efficiently without drowning in data.

The RED Method for Services

Google’s SRE team popularized the RED method for monitoring request-driven services:

  • Rate: How many requests per second? Count successful and failed requests separately.
  • Errors: What percentage are failing? Track both HTTP errors (5xx, 4xx) and application errors (timeout, internal exceptions).
  • Duration: How long do they take? Not just the average — the 50th, 95th, and 99th percentile latency matter more than the mean.

Why percentiles? Imagine 99 requests complete in 10ms, but 1 request takes 5 seconds. The average is 60ms, which sounds fine. But that one user is getting a miserable experience. The p99 latency of 5 seconds is the honest metric.

The USE Method for Resources

For infrastructure and databases, the USE method applies:

  • Utilization: What percentage of the resource is being used? CPU, memory, disk, network, connection pools.
  • Saturation: How many jobs are waiting for the resource? Queue depth, context switches, page faults.
  • Errors: How many errors is the resource producing? Packet drops, disk errors, timeout exceptions.

Metric Types

Counters only go up (or reset). Total HTTP requests served, total errors, total bytes sent. Useful for rates (requests_per_second = counter_diff / time_interval).

Gauges are point-in-time values. Current CPU usage, active database connections, queue size. Gauges can go up and down.

Histograms track distributions. Instead of storing every latency value, they bucket observations: “1,000 requests under 10ms, 5,000 requests 10-50ms, 2,000 requests 50-100ms.” Histograms let you calculate percentiles.

Time-Series Databases

Metrics are stored in time-series databases like Prometheus, InfluxDB, or Datadog. They’re optimized for write-heavy workloads (millions of data points per second) and efficient queries over time ranges.

# Example Prometheus metric format
http_requests_total{service="order-api", method="POST", status="200"} 15627
http_request_duration_seconds{service="order-api", le="0.1"} 1000
http_request_duration_seconds{service="order-api", le="0.5"} 12500
http_request_duration_seconds{service="order-api", le="1.0"} 15000

Logs: The Detective Work

Logs are where developers and operators do detective work. But raw logs are chaos. A single request to a web server can generate dozens of log lines across multiple services. Without structure, you’re grep-ing through terabytes of text.

Structured Logging

Modern systems use structured logging (JSON logs) where each log entry is a machine-readable record:

{
  "timestamp": "2024-02-13T03:15:42.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "user_id": "user_12345",
  "message": "Payment processing failed",
  "error": "ConnectTimeout",
  "duration_ms": 5000,
  "payment_processor": "stripe",
  "amount": 9999,
  "currency": "USD"
}

Notice the trace_id and span_id fields. These link this log entry to the trace system. This is crucial for correlation.

Log Levels

  • DEBUG: Detailed information, typically used during development. In production, you usually disable these.
  • INFO: Confirmation that things are working as expected. “User 12345 logged in from IP 192.168.1.1.”
  • WARN: Something unexpected happened, but the system recovered. “Retry 3 of 5 for database connection. Backing off.”
  • ERROR: Something failed, but the system is still running. “Payment processor timeout. Transaction will be retried.”
  • FATAL: The system cannot continue. “Database connection pool exhausted. Shutting down.”

Log Storage and Querying

Systems like Elasticsearch, Loki (for logs), or Splunk store and index logs for querying. You need to be able to quickly answer questions like: “Show me all errors from the payment service in the last hour” or “Find all requests with trace_id 4bf92f3577b34da6a3ce929d0e0e4736.”

Traces: Following the Request

Distributed tracing is the superpower that brings everything together. In a microservices architecture, a single user request might touch 5-10 services. Without tracing, correlating events across all of them is manual and painful.

Spans and Trace Context

A span represents a single operation: “call the payment API,” “query the user database,” “render the checkout page.” Each span has:

  • Operation name: “PaymentService.charge”
  • Start time and duration: When it started and how long it took
  • Trace ID: Identifier for the entire request journey
  • Parent span ID: Which span triggered this one
  • Tags: Key-value metadata like user_id=12345, status=200
  • Logs: Timestamped events within the span

A trace is a collection of related spans forming a tree. The root span is the initial request, and child spans are operations triggered by the root.

Trace Context Propagation

The magic is that trace IDs must flow across service boundaries. When service A calls service B, it includes the trace ID in the request header (via W3C Trace Context standard):

GET /api/orders/12345 HTTP/1.1
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor_data=value

Service B extracts this header, creates a span for “fetch order,” and includes the same trace ID. When B calls service C, B becomes the parent. The entire journey is linked.

OpenTelemetry: The Standard

OpenTelemetry is the open standard for instrumentation. Instead of vendor lock-in (where switching from Datadog to New Relic requires rewriting instrumentation), OpenTelemetry provides a vendor-neutral API.

from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Setup
otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317")
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))

tracer = trace.get_tracer(__name__)

# Instrument your code
with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("user_id", "12345")
    span.set_attribute("amount", 9999)
    try:
        result = payment_service.charge(user_id, amount)
        span.set_attribute("status", "success")
    except Exception as e:
        span.record_exception(e)
        span.set_attribute("status", "error")
        raise

The Observability Stack

Here’s how these pieces fit together in a modern observability platform:

graph LR
    A["Application<br/>(OpenTelemetry SDK)"] -->|OTLP| B["Collector"]
    B -->|Metrics| C["Prometheus"]
    B -->|Logs| D["Loki"]
    B -->|Traces| E["Tempo"]
    C -->|Query| F["Grafana"]
    D -->|Query| F
    E -->|Query| F
    F -->|Visualize| G["Dashboard"]
  1. Instrumentation: Your application uses OpenTelemetry SDK to emit metrics, logs, and traces.
  2. Collection: The OpenTelemetry Collector receives, processes, and batches this data.
  3. Storage: Metrics go to Prometheus, logs to Loki, traces to Tempo.
  4. Visualization: Grafana queries all three backends and shows unified dashboards.

This separation of concerns means you can replace any component. Swap Prometheus for InfluxDB, Loki for Elasticsearch, Tempo for Jaeger — the application code stays the same.

Correlation: Bringing It All Together

Here’s where the magic happens. You’re looking at a Grafana dashboard and notice that error rate spiked at 3:15 AM:

Metric Alert: error_rate > 5% at 2024-02-13T03:15:42Z

You click on the spike, which shows you the time window. You switch to Loki and query logs from that same window:

{service="payment-service"} | level="ERROR" at 2024-02-13T03:15:42Z

You get back 500 error logs. The first one contains:

{"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "error": "ConnectTimeout"}

You switch to Tempo and search for trace ID 4bf92f3577b34da6a3ce929d0e0e4736. You see a waterfall:

Root Span: user_checkout (2000ms)
  ├─ order_service.create (500ms)
  │   └─ database.insert (100ms)
  ├─ payment_service.charge (6000ms) <- SLOW!
  │   ├─ stripe_api.charge (5900ms) <- TIMEOUT!
  └─ inventory_service.reserve (400ms)

The payment service called Stripe, and Stripe was timing out. Three systems (metrics, logs, traces) working together, and you’ve pinpointed the root cause in under 60 seconds.

The Trade-offs

Storage Costs: Logs are verbose. A typical application might generate 1KB of logs per request. At 10,000 requests/second, that’s 10GB per second. You need aggressive retention policies and compression.

Cardinality Explosion: If you tag metrics with every user ID, you’ll create millions of unique metric series. Prometheus will grind to a halt. You need to be selective with high-cardinality dimensions.

Trace Sampling: You can’t afford to trace every request in a high-traffic system. Sampling strategies are crucial. Head-based sampling (decide at the start) is simple but might miss the slow requests. Tail-based sampling (decide after the trace is complete) captures interesting traces but requires buffering.

Build vs Buy: You can build your observability stack with open-source tools (Prometheus, Loki, Tempo, Grafana) for operational cost but significant engineering effort. Or you can use SaaS platforms (Datadog, New Relic, Elastic) for higher cost but less operational burden.

Key Takeaways

  • Observability is the ability to understand your system’s internal state from external outputs. It’s not just monitoring.
  • Metrics (WHAT), logs (WHY), and traces (WHERE) are complementary and most powerful when used together.
  • Use the RED method for services and the USE method for infrastructure.
  • Structured, correlated logging with trace IDs is essential.
  • OpenTelemetry is the standard for instrumentation and decouples from vendors.
  • The hardest part of observability isn’t the technology — it’s deciding which metrics to collect and alerting on the ones that matter.

Practice Scenarios

Scenario 1: You have a microservices system with an API gateway, user service, order service, and payment service. A customer reports that creating an order takes 30 seconds sometimes but is instant other times. Design an observability strategy to diagnose the issue. What metrics would you collect? How would you structure logs? How would tracing help?

Scenario 2: Your Prometheus database is consuming 500GB of storage per day because you’re tracking metrics with user ID as a dimension (high cardinality). Devise a sampling or aggregation strategy to reduce storage while still being able to debug user-specific issues when they report problems.

Scenario 3: You’re building a new notification service that sends emails, SMS, and push notifications. Design the instrumentation using OpenTelemetry. What spans would you create? What attributes would you attach to each span?


Next: In the next section, we’ll focus on which KPIs actually matter and how to design alerting systems that catch real problems without spamming your on-call engineers with false alarms.