Monitoring & Observability Tools - System Design Fundamentals

Monitoring and Observability Tools

You can’t run a production system blind. Observability is how you understand what your system is actually doing—where it’s failing, where it’s slow, where resources are bottlenecked. The difference between monitoring and observability is subtle but important: monitoring tells you that something is wrong; observability lets you understand why. This reference covers the most widely-used tools across the three pillars of observability—metrics, logs, and traces—plus the unified platforms that combine them.

The Three Pillars of Observability

Metrics: Quantitative measurements over time (CPU usage, request latency, error rates). Time-series data. Best for dashboards, alerting, and trend analysis.

Logs: Discrete events and detailed context (request ID, user ID, stack traces). Text-based. Best for debugging and understanding what happened.

Traces: End-to-end request flow across services (where did the request go, which services touched it, how long in each). Best for understanding distributed systems behavior.

Effective observability uses all three together. You see a spike in latency (metrics alert), drill into traces to see which service is slow, and examine logs from that service to understand why.

Metrics: Collection and Visualization

Prometheus

Prometheus is the de facto standard metrics tool in the Kubernetes ecosystem and increasingly everywhere else. It’s open-source, free, and battle-tested at massive scale.

How it works:

Pull-based collection: Prometheus scrapes HTTP endpoints (targets) on a schedule
Targets expose metrics in a simple text format
Metrics are stored as time-series data locally
PromQL is a powerful query language for metrics

Key strengths:

Simple to deploy; single binary with no external dependencies
Native Kubernetes support (service discovery via labels)
Excellent alerting with Alertmanager (rule-based, flexible routing)
Rich ecosystem: exporters for everything (Node Exporter for OS metrics, MySQL Exporter, etc.)

Limitations:

Single-server storage (though remote storage options exist)
Not ideal for long-term retention (typically 15 days default)
Cardinality explosion risk with too many label combinations
No built-in authentication or multi-tenancy

When to use Prometheus:

Running Kubernetes
Building observability from scratch
Want open-source with strong community
Don’t mind managing your own infrastructure

Grafana

Grafana is the visualization layer. It doesn’t collect metrics itself; it queries datasources (Prometheus, InfluxDB, Loki, Datadog, etc.) and renders beautiful dashboards.

Key features:

Dashboards: Create custom visualizations with variables, templates, and drill-down capability
Alerting: Define alert rules on any metric query
Annotation: Mark events on graphs (deployments, incidents)
User management and authentication
Plugin ecosystem for custom panels and datasources

When to use Grafana:

You’re using Prometheus or Loki (or any other datasource)
Need professional dashboarding without massive cost
Open-source preference

Pro tip: Use Grafana’s alert notification channels to send notifications to PagerDuty, Slack, email, webhooks, etc. It’s more flexible than Prometheus’s Alertmanager for complex routing.

Datadog

Datadog is a commercial SaaS platform that covers metrics, logs, traces, APM, and infrastructure monitoring in one place. It’s expensive at scale but popular because it “just works” without operational overhead.

Key strengths:

All-in-one platform (no tool integration needed)
Excellent out-of-the-box dashboards and alerting
Strong APM for understanding application performance
Easy onboarding (send data via agents, quick integration)
24/7 support

Key tradeoffs:

Per-host or per-metric pricing scales quickly
Vendor lock-in (exporting your data is difficult)
Can get expensive for large-scale deployments (tens of millions of metrics)

When to use Datadog:

Team has budget and values simplicity
Need end-to-end platform without integration work
Willing to trade cost for operational convenience
Already using other Datadog products

AWS CloudWatch

If your workload lives entirely on AWS, CloudWatch provides native monitoring integrated with all AWS services.

Key features:

Automatic metrics from EC2, RDS, Lambda, S3, and 200+ AWS services
Log aggregation (CloudWatch Logs)
Custom metrics via API
Dashboards, alarms, and anomaly detection
No additional agent needed for AWS services

Tradeoffs:

AWS-only (not multi-cloud)
Less sophisticated query language than PromQL
Pricing based on ingestion and queries (can be expensive)

When to use CloudWatch:

Entirely AWS-based
Want simplicity with no third-party tools
Need tight integration with AWS services
Don’t need observability for non-AWS infrastructure

Logging: Collection and Aggregation

ELK Stack (Elasticsearch, Logstash, Kibana)

The classic logging architecture, popular for over a decade.

Components:

Elasticsearch: Distributed search and analytics engine. Indexes logs for full-text search.
Logstash: Log collector and parser. Runs on servers, forwards logs to Elasticsearch.
Kibana: Web interface for searching, visualizing, and analyzing logs.

Strengths:

Mature, battle-tested, enormous community
Powerful full-text search (find any log containing “error message”)
Rich visualization and dashboard capabilities
Multi-tenancy and security features in paid version

Limitations:

Resource-heavy (Elasticsearch requires significant RAM and storage)
Complex to deploy and maintain
Per-node indexing costs; long-term retention is expensive
Operational overhead increases with scale

When to use ELK:

Need powerful full-text search of logs
Have dedicated ops team to manage it
Long-term log retention required
Already invested in Elasticsearch

Pro tip: Use Beats (lightweight log shippers) instead of Logstash agents for lower resource consumption.

Grafana Loki

Loki is a newer logging solution designed for Kubernetes and Grafana. Instead of indexing entire logs, it indexes labels (like pod name, namespace) and stores log lines compressed.

Key characteristics:

Label-based indexing: You query by labels, not full-text search
Much cheaper than Elasticsearch: 10-50x less storage and compute
LogQL: Query language similar to PromQL (great if you know Prometheus)
Tight Grafana integration: Logs alongside metrics on same dashboard
Simple deployment: Runs in Kubernetes via Helm

Limitations:

No full-text search (you must know the right labels)
Newer than ELK (smaller ecosystem)
Limited pattern analysis compared to Elasticsearch

When to use Loki:

Running Kubernetes
Already using Grafana for metrics
Cost is a primary concern
Logs are accessed by labels (service, environment, pod)

Example Loki query:

{job="api-server", level="error"} | json | status_code=500

Fluentd and Fluent Bit

Log collectors and shippers. They parse logs, filter, and forward to backends (Elasticsearch, S3, Splunk, Loki, etc.).

Fluentd:

Ruby-based, extensible with plugins
Complex log pipelines (multi-stage processing)
Higher resource consumption

Fluent Bit:

Written in C, minimal footprint
Designed for edge and containerized environments
Subset of Fluentd functionality

When to use:

Building custom log pipelines
Need lightweight log collection at scale
Collecting logs from multiple sources (containers, applications, infrastructure)

Tracing: End-to-End Request Flow

Traces show you the complete journey of a request through your system. Essential for understanding distributed systems behavior and diagnosing latency issues.

Jaeger

Jaeger is an open-source distributed tracing system (CNCF project) designed for microservices.

Key components:

Jaeger Agent: Lightweight process collecting spans from applications
Jaeger Collector: Receives traces and stores them
Jaeger Query: Web UI for searching and visualizing traces
Storage: Cassandra or Elasticsearch backend (or in-memory for dev)

Strengths:

CNCF project (production-grade)
Native Kubernetes support
Multiple language SDKs
Excellent visualization of distributed flows
Sampling strategies to reduce storage

When to use Jaeger:

Running microservices on Kubernetes
Need open-source tracing with CNCF backing
Have infrastructure to run it (or use managed options)

Zipkin

One of the earliest distributed tracing systems. Similar to Jaeger but with longer history and simpler architecture.

When to use Zipkin:

Already using it
Prefer established projects
Smaller team (easier to operate than Jaeger)

OpenTelemetry

OpenTelemetry (OTel) is vendor-neutral instrumentation framework that merges OpenTracing and OpenCensus. It’s becoming the standard for application instrumentation across all languages.

What it provides:

Language SDKs for all major languages (Python, Java, Go, JavaScript, etc.)
Automatic instrumentation for popular frameworks (Flask, Spring, Express, etc.)
Export to any backend (Jaeger, Zipkin, Datadog, New Relic, AWS X-Ray, etc.)

Key advantage:

You instrument once, export to any backend. Switch backends without rewriting code.

When to use OpenTelemetry:

Starting a new project (future-proof choice)
Want flexibility on which backend to use
Need consistent instrumentation across multiple languages
Concerned about vendor lock-in

Simple Python example:

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

jaeger_exporter = JaegerExporter(agent_host_name="localhost")
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("my_operation") as span:
    span.set_attribute("user_id", 123)
    # Your code here

Pro tip: Use automatic instrumentation whenever possible. Most languages have OTel instrumentation packages for popular frameworks that require only importing and minimal configuration.

Unified Observability Platforms

Datadog

Already covered in metrics, but worth repeating: Datadog is the most complete all-in-one platform for metrics, logs, traces, APM, and infrastructure monitoring. Single pane of glass, excellent support, but expensive.

Grafana Cloud

Hosted version of Prometheus, Loki, and Tempo (tracing). Managed by Grafana Labs. Includes hosted Alertmanager and support.

When to use Grafana Cloud:

Like the open-source tools but don’t want to operate them
Want cost savings compared to Datadog
Already using Grafana and want managed infrastructure

New Relic

Full-stack observability platform. Free tier available. Competitive with Datadog but often cheaper at large scale.

Key features:

Metrics, logs, traces, APM all included
Free tier with good limits
Strong APM and infrastructure monitoring
Good support

Splunk

Enterprise-grade observability and security analytics. Most expensive option but most powerful for complex queries and compliance-heavy organizations.

When to use Splunk:

Enterprise organization with budget
Need advanced analytics and compliance reporting
Log volume in petabytes per day

Incident Management and On-Call

PagerDuty

Industry standard for on-call management and incident response.

Key features:

On-call scheduling and rotations
Alert routing and escalation
Incident tracking and postmortem workflows
Integrates with monitoring tools (Prometheus, Datadog, CloudWatch, etc.)

When to use:

Production system with SLAs
Multiple on-call engineers
Need structured incident response

Grafana OnCall

Open-source alternative to PagerDuty. Integrated with Grafana alerts.

When to use:

Like Grafana ecosystem
Cost-conscious
Simple on-call requirements

OpsGenie (Atlassian)

Alert and on-call management. Integrates well with Jira.

When to use:

Already deep in Atlassian ecosystem
Need Jira integration for incident tracking

Comparison Matrix

Tool	Pillar	Open Source	SaaS	Best For
Prometheus	Metrics	Yes	No	Kubernetes, high cardinality metrics
Grafana	Metrics/Logs/Traces	Yes	Yes (Cloud)	Visualization, multi-datasource dashboards
Datadog	All	No	Yes	All-in-one, simplicity, APM
AWS CloudWatch	All	No	Native	AWS-only environments
ELK Stack	Logs	Yes	No	Full-text search, long-term retention
Loki	Logs	Yes	Yes (Cloud)	Kubernetes, cost-effective, label queries
Fluentd/Fluent Bit	Log Collection	Yes	No	Custom log pipelines, edge collection
Jaeger	Traces	Yes	No	Kubernetes, microservices, open-source
Zipkin	Traces	Yes	No	Established systems, simple setup
OpenTelemetry	Instrumentation	Yes	No	Backend-agnostic instrumentation
Splunk	All	No	Yes	Enterprise, compliance, petabyte scale
PagerDuty	Incident Management	No	Yes	Production on-call, SLA-driven

Recommended Stacks

Cost-conscious open-source stack:

Prometheus for metrics
Grafana for visualization and dashboards
Loki for logs
Jaeger for traces
OpenTelemetry for instrumentation
Grafana OnCall for incident management
Total cost: basically free (except compute infrastructure)

All-in-one commercial stack:

Datadog for everything
Pros: Single vendor, easy setup, excellent support
Cons: Expensive at scale

Hybrid approach:

Prometheus + Grafana for internal metrics and dashboards
Datadog for logs, traces, and APM (deeper analysis)
Use OpenTelemetry to instrument (allows future flexibility)

Getting Started

Prometheus + Grafana quickstart (Docker Compose):

version: '3'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana-storage:/var/lib/grafana

volumes:
  grafana-storage:

Instrument a Python app with OpenTelemetry:

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-jaeger
pip install opentelemetry-instrumentation-flask opentelemetry-instrumentation-requests

from flask import Flask
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

app = Flask(__name__)

# Auto-instrument Flask
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

# Export to Jaeger
JaegerExporter.tracer_provider = TracerProvider()
JaegerExporter.tracer_provider.add_span_processor(
    BatchSpanProcessor(JaegerExporter(agent_host_name="localhost"))
)

Key Takeaways

Start with the three pillars: Metrics (what), logs (details), traces (where). All three together give you observability.
Prometheus + Grafana + Loki + Jaeger is a solid, cost-effective, open-source foundation for any system
Datadog is the most complete commercial platform; choose based on budget and operational capacity
Use OpenTelemetry for instrumentation: It decouples your code from backend choice
Instrument early: Adding observability later is painful; bake it in from the beginning
Set up alerting: Metrics are useless without alerting. Define meaningful thresholds and route to on-call engineers via PagerDuty or similar

These tools form the practical toolkit for understanding the systems you build. Monitor what matters, log enough to debug, trace critical flows, and alert on conditions that require human response. The systems described throughout this book require this observability foundation to run reliably in production.

See Appendix B for deeper resources on implementing observability.