System Design Fundamentals

Monitoring & Observability Tools

A

Monitoring and Observability Tools

You can’t run a production system blind. Observability is how you understand what your system is actually doing—where it’s failing, where it’s slow, where resources are bottlenecked. The difference between monitoring and observability is subtle but important: monitoring tells you that something is wrong; observability lets you understand why. This reference covers the most widely-used tools across the three pillars of observability—metrics, logs, and traces—plus the unified platforms that combine them.

The Three Pillars of Observability

Metrics: Quantitative measurements over time (CPU usage, request latency, error rates). Time-series data. Best for dashboards, alerting, and trend analysis.

Logs: Discrete events and detailed context (request ID, user ID, stack traces). Text-based. Best for debugging and understanding what happened.

Traces: End-to-end request flow across services (where did the request go, which services touched it, how long in each). Best for understanding distributed systems behavior.

Effective observability uses all three together. You see a spike in latency (metrics alert), drill into traces to see which service is slow, and examine logs from that service to understand why.

Metrics: Collection and Visualization

Prometheus

Prometheus is the de facto standard metrics tool in the Kubernetes ecosystem and increasingly everywhere else. It’s open-source, free, and battle-tested at massive scale.

How it works:

  • Pull-based collection: Prometheus scrapes HTTP endpoints (targets) on a schedule
  • Targets expose metrics in a simple text format
  • Metrics are stored as time-series data locally
  • PromQL is a powerful query language for metrics

Key strengths:

  • Simple to deploy; single binary with no external dependencies
  • Native Kubernetes support (service discovery via labels)
  • Excellent alerting with Alertmanager (rule-based, flexible routing)
  • Rich ecosystem: exporters for everything (Node Exporter for OS metrics, MySQL Exporter, etc.)

Limitations:

  • Single-server storage (though remote storage options exist)
  • Not ideal for long-term retention (typically 15 days default)
  • Cardinality explosion risk with too many label combinations
  • No built-in authentication or multi-tenancy

When to use Prometheus:

  • Running Kubernetes
  • Building observability from scratch
  • Want open-source with strong community
  • Don’t mind managing your own infrastructure

Grafana

Grafana is the visualization layer. It doesn’t collect metrics itself; it queries datasources (Prometheus, InfluxDB, Loki, Datadog, etc.) and renders beautiful dashboards.

Key features:

  • Dashboards: Create custom visualizations with variables, templates, and drill-down capability
  • Alerting: Define alert rules on any metric query
  • Annotation: Mark events on graphs (deployments, incidents)
  • User management and authentication
  • Plugin ecosystem for custom panels and datasources

When to use Grafana:

  • You’re using Prometheus or Loki (or any other datasource)
  • Need professional dashboarding without massive cost
  • Open-source preference

Pro tip: Use Grafana’s alert notification channels to send notifications to PagerDuty, Slack, email, webhooks, etc. It’s more flexible than Prometheus’s Alertmanager for complex routing.

Datadog

Datadog is a commercial SaaS platform that covers metrics, logs, traces, APM, and infrastructure monitoring in one place. It’s expensive at scale but popular because it “just works” without operational overhead.

Key strengths:

  • All-in-one platform (no tool integration needed)
  • Excellent out-of-the-box dashboards and alerting
  • Strong APM for understanding application performance
  • Easy onboarding (send data via agents, quick integration)
  • 24/7 support

Key tradeoffs:

  • Per-host or per-metric pricing scales quickly
  • Vendor lock-in (exporting your data is difficult)
  • Can get expensive for large-scale deployments (tens of millions of metrics)

When to use Datadog:

  • Team has budget and values simplicity
  • Need end-to-end platform without integration work
  • Willing to trade cost for operational convenience
  • Already using other Datadog products

AWS CloudWatch

If your workload lives entirely on AWS, CloudWatch provides native monitoring integrated with all AWS services.

Key features:

  • Automatic metrics from EC2, RDS, Lambda, S3, and 200+ AWS services
  • Log aggregation (CloudWatch Logs)
  • Custom metrics via API
  • Dashboards, alarms, and anomaly detection
  • No additional agent needed for AWS services

Tradeoffs:

  • AWS-only (not multi-cloud)
  • Less sophisticated query language than PromQL
  • Pricing based on ingestion and queries (can be expensive)

When to use CloudWatch:

  • Entirely AWS-based
  • Want simplicity with no third-party tools
  • Need tight integration with AWS services
  • Don’t need observability for non-AWS infrastructure

Logging: Collection and Aggregation

ELK Stack (Elasticsearch, Logstash, Kibana)

The classic logging architecture, popular for over a decade.

Components:

  • Elasticsearch: Distributed search and analytics engine. Indexes logs for full-text search.
  • Logstash: Log collector and parser. Runs on servers, forwards logs to Elasticsearch.
  • Kibana: Web interface for searching, visualizing, and analyzing logs.

Strengths:

  • Mature, battle-tested, enormous community
  • Powerful full-text search (find any log containing “error message”)
  • Rich visualization and dashboard capabilities
  • Multi-tenancy and security features in paid version

Limitations:

  • Resource-heavy (Elasticsearch requires significant RAM and storage)
  • Complex to deploy and maintain
  • Per-node indexing costs; long-term retention is expensive
  • Operational overhead increases with scale

When to use ELK:

  • Need powerful full-text search of logs
  • Have dedicated ops team to manage it
  • Long-term log retention required
  • Already invested in Elasticsearch

Pro tip: Use Beats (lightweight log shippers) instead of Logstash agents for lower resource consumption.

Grafana Loki

Loki is a newer logging solution designed for Kubernetes and Grafana. Instead of indexing entire logs, it indexes labels (like pod name, namespace) and stores log lines compressed.

Key characteristics:

  • Label-based indexing: You query by labels, not full-text search
  • Much cheaper than Elasticsearch: 10-50x less storage and compute
  • LogQL: Query language similar to PromQL (great if you know Prometheus)
  • Tight Grafana integration: Logs alongside metrics on same dashboard
  • Simple deployment: Runs in Kubernetes via Helm

Limitations:

  • No full-text search (you must know the right labels)
  • Newer than ELK (smaller ecosystem)
  • Limited pattern analysis compared to Elasticsearch

When to use Loki:

  • Running Kubernetes
  • Already using Grafana for metrics
  • Cost is a primary concern
  • Logs are accessed by labels (service, environment, pod)

Example Loki query:

{job="api-server", level="error"} | json | status_code=500

Fluentd and Fluent Bit

Log collectors and shippers. They parse logs, filter, and forward to backends (Elasticsearch, S3, Splunk, Loki, etc.).

Fluentd:

  • Ruby-based, extensible with plugins
  • Complex log pipelines (multi-stage processing)
  • Higher resource consumption

Fluent Bit:

  • Written in C, minimal footprint
  • Designed for edge and containerized environments
  • Subset of Fluentd functionality

When to use:

  • Building custom log pipelines
  • Need lightweight log collection at scale
  • Collecting logs from multiple sources (containers, applications, infrastructure)

Tracing: End-to-End Request Flow

Traces show you the complete journey of a request through your system. Essential for understanding distributed systems behavior and diagnosing latency issues.

Jaeger

Jaeger is an open-source distributed tracing system (CNCF project) designed for microservices.

Key components:

  • Jaeger Agent: Lightweight process collecting spans from applications
  • Jaeger Collector: Receives traces and stores them
  • Jaeger Query: Web UI for searching and visualizing traces
  • Storage: Cassandra or Elasticsearch backend (or in-memory for dev)

Strengths:

  • CNCF project (production-grade)
  • Native Kubernetes support
  • Multiple language SDKs
  • Excellent visualization of distributed flows
  • Sampling strategies to reduce storage

When to use Jaeger:

  • Running microservices on Kubernetes
  • Need open-source tracing with CNCF backing
  • Have infrastructure to run it (or use managed options)

Zipkin

One of the earliest distributed tracing systems. Similar to Jaeger but with longer history and simpler architecture.

When to use Zipkin:

  • Already using it
  • Prefer established projects
  • Smaller team (easier to operate than Jaeger)

OpenTelemetry

OpenTelemetry (OTel) is vendor-neutral instrumentation framework that merges OpenTracing and OpenCensus. It’s becoming the standard for application instrumentation across all languages.

What it provides:

  • Language SDKs for all major languages (Python, Java, Go, JavaScript, etc.)
  • Automatic instrumentation for popular frameworks (Flask, Spring, Express, etc.)
  • Export to any backend (Jaeger, Zipkin, Datadog, New Relic, AWS X-Ray, etc.)

Key advantage:

  • You instrument once, export to any backend. Switch backends without rewriting code.

When to use OpenTelemetry:

  • Starting a new project (future-proof choice)
  • Want flexibility on which backend to use
  • Need consistent instrumentation across multiple languages
  • Concerned about vendor lock-in

Simple Python example:

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

jaeger_exporter = JaegerExporter(agent_host_name="localhost")
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("my_operation") as span:
    span.set_attribute("user_id", 123)
    # Your code here

Pro tip: Use automatic instrumentation whenever possible. Most languages have OTel instrumentation packages for popular frameworks that require only importing and minimal configuration.

Unified Observability Platforms

Datadog

Already covered in metrics, but worth repeating: Datadog is the most complete all-in-one platform for metrics, logs, traces, APM, and infrastructure monitoring. Single pane of glass, excellent support, but expensive.

Grafana Cloud

Hosted version of Prometheus, Loki, and Tempo (tracing). Managed by Grafana Labs. Includes hosted Alertmanager and support.

When to use Grafana Cloud:

  • Like the open-source tools but don’t want to operate them
  • Want cost savings compared to Datadog
  • Already using Grafana and want managed infrastructure

New Relic

Full-stack observability platform. Free tier available. Competitive with Datadog but often cheaper at large scale.

Key features:

  • Metrics, logs, traces, APM all included
  • Free tier with good limits
  • Strong APM and infrastructure monitoring
  • Good support

Splunk

Enterprise-grade observability and security analytics. Most expensive option but most powerful for complex queries and compliance-heavy organizations.

When to use Splunk:

  • Enterprise organization with budget
  • Need advanced analytics and compliance reporting
  • Log volume in petabytes per day

Incident Management and On-Call

PagerDuty

Industry standard for on-call management and incident response.

Key features:

  • On-call scheduling and rotations
  • Alert routing and escalation
  • Incident tracking and postmortem workflows
  • Integrates with monitoring tools (Prometheus, Datadog, CloudWatch, etc.)

When to use:

  • Production system with SLAs
  • Multiple on-call engineers
  • Need structured incident response

Grafana OnCall

Open-source alternative to PagerDuty. Integrated with Grafana alerts.

When to use:

  • Like Grafana ecosystem
  • Cost-conscious
  • Simple on-call requirements

OpsGenie (Atlassian)

Alert and on-call management. Integrates well with Jira.

When to use:

  • Already deep in Atlassian ecosystem
  • Need Jira integration for incident tracking

Comparison Matrix

ToolPillarOpen SourceSaaSBest For
PrometheusMetricsYesNoKubernetes, high cardinality metrics
GrafanaMetrics/Logs/TracesYesYes (Cloud)Visualization, multi-datasource dashboards
DatadogAllNoYesAll-in-one, simplicity, APM
AWS CloudWatchAllNoNativeAWS-only environments
ELK StackLogsYesNoFull-text search, long-term retention
LokiLogsYesYes (Cloud)Kubernetes, cost-effective, label queries
Fluentd/Fluent BitLog CollectionYesNoCustom log pipelines, edge collection
JaegerTracesYesNoKubernetes, microservices, open-source
ZipkinTracesYesNoEstablished systems, simple setup
OpenTelemetryInstrumentationYesNoBackend-agnostic instrumentation
SplunkAllNoYesEnterprise, compliance, petabyte scale
PagerDutyIncident ManagementNoYesProduction on-call, SLA-driven

Cost-conscious open-source stack:

  • Prometheus for metrics
  • Grafana for visualization and dashboards
  • Loki for logs
  • Jaeger for traces
  • OpenTelemetry for instrumentation
  • Grafana OnCall for incident management
  • Total cost: basically free (except compute infrastructure)

All-in-one commercial stack:

  • Datadog for everything
  • Pros: Single vendor, easy setup, excellent support
  • Cons: Expensive at scale

Hybrid approach:

  • Prometheus + Grafana for internal metrics and dashboards
  • Datadog for logs, traces, and APM (deeper analysis)
  • Use OpenTelemetry to instrument (allows future flexibility)

Getting Started

Prometheus + Grafana quickstart (Docker Compose):

version: '3'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana-storage:/var/lib/grafana

volumes:
  grafana-storage:

Instrument a Python app with OpenTelemetry:

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-jaeger
pip install opentelemetry-instrumentation-flask opentelemetry-instrumentation-requests
from flask import Flask
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

app = Flask(__name__)

# Auto-instrument Flask
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

# Export to Jaeger
JaegerExporter.tracer_provider = TracerProvider()
JaegerExporter.tracer_provider.add_span_processor(
    BatchSpanProcessor(JaegerExporter(agent_host_name="localhost"))
)

Key Takeaways

  • Start with the three pillars: Metrics (what), logs (details), traces (where). All three together give you observability.
  • Prometheus + Grafana + Loki + Jaeger is a solid, cost-effective, open-source foundation for any system
  • Datadog is the most complete commercial platform; choose based on budget and operational capacity
  • Use OpenTelemetry for instrumentation: It decouples your code from backend choice
  • Instrument early: Adding observability later is painful; bake it in from the beginning
  • Set up alerting: Metrics are useless without alerting. Define meaningful thresholds and route to on-call engineers via PagerDuty or similar

These tools form the practical toolkit for understanding the systems you build. Monitor what matters, log enough to debug, trace critical flows, and alert on conditions that require human response. The systems described throughout this book require this observability foundation to run reliably in production.

See Appendix B for deeper resources on implementing observability.