System Design Fundamentals

Distributed Tracing

A

Distributed Tracing

The Mystery of the Missing 8 Seconds

A customer calls support: “Loading my order history is taking 8 seconds. It’s never been this slow before.”

You jump into your observability system. You check the metrics. The API gateway? Average latency 50ms, p99 100ms. The order service? Average 80ms, p99 150ms. The product service? Average 120ms, p99 300ms. The inventory service? Average 90ms, p99 200ms. All four services are running fine.

But somehow, the user’s request is taking 8 seconds.

This is the microservices latency puzzle. Each service looks fine individually, but the request is mysteriously slow. Is it network latency between services? Is one service calling another service multiple times? Is there contention in a shared database?

Without distributed tracing, you’re left guessing. You might restart services randomly, hoping something sticks. Or you might add more servers, throwing money at the problem without solving it.

With distributed tracing, you’d see a waterfall diagram:

User Request: /api/user/12345/orders (8000ms total)
├─ API Gateway (10ms)
├─ Order Service: list_orders (3000ms)
│  └─ Database: SELECT * FROM orders WHERE user_id = ? (2900ms)
├─ Product Service: enrich_products (4800ms) ← starts while Order Service is running
│  ├─ Product Service: fetch_product (100ms) × 50 calls  (1000ms total)
│  ├─ Inventory Service: check_stock (100ms) × 50 calls  (3000ms total)
│  └─ Internal loop waiting (800ms) ← mystery solved!
└─ Response assembly (200ms)

You’d immediately see that the product service is making 50 sequential calls to the inventory service. That’s the bottleneck. Not the order database. Not the API gateway. The product service is doing N+1 queries.

Distributed tracing is the detective tool that finds these hidden bottlenecks. It’s observability’s most powerful weapon.

What Is a Trace?

A trace is a complete record of a single request’s journey through your system. It answers: “This user clicked a button at 3:15 AM. What happened between the click and when they saw the result?”

Spans: The Building Blocks

A span represents a single operation: one unit of work. It has:

  • Operation name: “list_orders”, “fetch_product”, “query_database”
  • Trace ID: Unique identifier linking all spans for this request
  • Parent Span ID: Which span triggered this one (forms the tree)
  • Start time: When the operation began
  • Duration: How long the operation took
  • Attributes (tags): Metadata like user_id=12345, product_count=50, status=success
  • Events: Timestamped log entries within the span (e.g., “Cache miss at 1ms”)
  • Status: success, failed, or unknown

A simple span might look like:

Span: "database_query"
├─ Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736
├─ Span ID: 00f067aa0ba902b7
├─ Parent Span ID: b7ad6b7169203331
├─ Start: 2024-02-13T03:15:42.100Z
├─ Duration: 2900ms
├─ Status: success
├─ Attributes:
│  ├─ db.system: postgresql
│  ├─ db.statement: SELECT * FROM orders WHERE user_id = $1
│  ├─ db.rows_returned: 45
│  └─ user_id: 12345
└─ Events:
   ├─ 2024-02-13T03:15:42.105Z - "Connection acquired"
   └─ 2024-02-13T03:15:43.000Z - "Results fetched"

Building the Trace Tree

A trace is a tree of spans. The root span is the entry point (usually the API request), and child spans represent operations triggered by the parent.

Here’s how it works:

  1. User hits /api/user/12345/orders
  2. API Gateway creates a root span: “http_request”
  3. API Gateway calls Order Service, passing the Trace ID in the HTTP header
  4. Order Service creates a child span: “list_orders”, with the parent span ID pointing to the root
  5. Order Service queries the database, creating another child: “database_query”
  6. Meanwhile, Order Service calls Product Service, creating another child: “enrich_products”
  7. Product Service calls Inventory Service 50 times, creating 50 child spans

The result is a tree showing the complete call graph:

http_request (root, 8000ms)
├─ list_orders (3000ms)
│  └─ database_query (2900ms)
├─ enrich_products (4800ms)
│  ├─ fetch_product (100ms) × 50
│  ├─ check_stock (100ms) × 50
│  └─ internal_processing (800ms)
└─ response_assembly (200ms)

The trace immediately shows you:

  • The product service is the bottleneck (4800ms out of 8000ms)
  • It’s making sequential calls to inventory (50 × 100ms = 5000ms potential, but shown as 3000ms due to some parallelism)
  • There’s 800ms of unexplained internal processing

This level of detail is impossible to get from metrics alone.

Trace Context Propagation: The Magic of Linking

For a trace to work across service boundaries, the trace ID must flow between services. When Service A calls Service B, it includes the trace ID in the request metadata.

W3C Trace Context Standard

The W3C Trace Context standard is the modern way to propagate trace context across HTTP, gRPC, messaging, and other protocols. It uses HTTP headers:

GET /api/orders/12345 HTTP/1.1
Host: inventory-service.example.com
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor_data=some_value

Breaking down the traceparent header:

  • 00: Version (currently 0)
  • 4bf92f3577b34da6a3ce929d0e0e4736: Trace ID (128-bit, hex)
  • 00f067aa0ba902b7: Parent Span ID (64-bit, hex) — the span in the calling service
  • 01: Trace flags (01 = sampled, 00 = not sampled)

When the inventory service receives this request, it:

  1. Extracts the trace ID and parent span ID
  2. Creates a new span with the same trace ID but its own span ID
  3. Sets the parent span ID to the value it received
  4. If it calls another service, it includes its own span ID as the parent in the next request
Order Service →(traceparent: trace=ABC, parent=SPAN1)→ Product Service

Product Service creates span SPAN2 with parent=SPAN1
Product Service →(traceparent: trace=ABC, parent=SPAN2)→ Inventory Service

Inventory Service creates span SPAN3 with parent=SPAN2

The result is a linked chain of spans forming the complete request tree.

Context Propagation in Different Protocols

HTTP Headers:

traceparent: 00-trace_id-parent_span_id-flags

gRPC Metadata:

metadata['traceparent'] = '00-trace_id-parent_span_id-flags'

Message Queue (Kafka, RabbitMQ):

{
  "message": "process_order",
  "payload": {...},
  "traceparent": "00-trace_id-parent_span_id-flags"
}

Database Connections:

-- In PostgreSQL, you can add context as a comment or variable
SET application_name = 'trace_id:4bf92f3577b34da6a3ce929d0e0e4736';
SELECT * FROM orders;

The key insight: trace context must travel with the request through every service and system boundary, so all operations are linked.

OpenTelemetry: The Open Standard

Historically, tracing was vendor-specific. If you used Datadog, your code used Datadog’s SDK. If you switched to New Relic, you rewrote instrumentation. OpenTelemetry fixed this by providing a vendor-neutral, open-source standard.

Manual Instrumentation

You explicitly create spans around operations:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Setup: Configure exporter and tracer provider
otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317")
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))

# Get a tracer
tracer = trace.get_tracer(__name__)

# Create spans
def list_orders(user_id):
    with tracer.start_as_current_span("list_orders") as span:
        span.set_attribute("user_id", user_id)
        span.set_attribute("span.kind", "internal")

        # Database query
        with tracer.start_as_current_span("database.query") as db_span:
            db_span.set_attribute("db.system", "postgresql")
            db_span.set_attribute("db.statement", "SELECT * FROM orders WHERE user_id = ?")
            db_span.set_attribute("user_id", user_id)

            try:
                orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id)
                db_span.set_attribute("db.rows_returned", len(orders))
            except Exception as e:
                db_span.record_exception(e)
                db_span.set_attribute("error.type", e.__class__.__name__)
                raise

        # Enrich orders with product data
        with tracer.start_as_current_span("enrich_products") as enrich_span:
            enrich_span.set_attribute("order_count", len(orders))
            products = enrich_order_products(orders)

        span.set_attribute("total_orders", len(orders))
        return orders

def call_external_service(service_name, request_data):
    with tracer.start_as_current_span("http.client") as span:
        span.set_attribute("http.method", "POST")
        span.set_attribute("http.url", f"https://service/{service_name}")

        try:
            response = requests.post(f"https://service/{service_name}", json=request_data)
            span.set_attribute("http.status_code", response.status_code)
            return response.json()
        except requests.Timeout as e:
            span.record_exception(e)
            span.set_attribute("http.status_code", 504)
            raise

Auto-Instrumentation

OpenTelemetry also provides auto-instrumentation, which uses Python/Java/Go instrumentation libraries to automatically create spans without touching your code:

# Python
pip install opentelemetry-distro opentelemetry-exporter-otlp

# Add to your environment
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Run your application with auto-instrumentation
opentelemetry-instrument python my_app.py

Auto-instrumentation is powerful because:

  • Minimal code changes
  • Consistent instrumentation across your application
  • Framework libraries (Flask, Django, requests, psycopg2) are automatically instrumented

The downside: less control, you don’t add custom business context.

Sampling: The Cost Reality

Here’s the uncomfortable truth: you cannot afford to trace every request in production.

If your system handles 10,000 requests per second, tracing every one means:

  • 10,000 spans per second
  • At 1KB per span average, that’s 10MB/second = 860GB/day
  • Storage costs become prohibitive

You need sampling strategies.

Head-Based Sampling

Decide whether to sample the trace at the start, before the entire request completes:

import random

def should_sample(request):
    # Sample 1% of requests
    if random.random() < 0.01:
        return True

    # Always sample requests that error or are slow
    if request.is_error or request.latency_ms > 1000:
        return True

    return False

Pros: Simple, no buffering required, low latency. Cons: You might miss the interesting slow request that completed fast, or only catch some of the errors.

Tail-Based Sampling

Decide whether to sample after the trace completes. This lets you make smarter decisions:

def should_sample(trace):
    # Sample all errors
    if trace.any_span_has_error():
        return True

    # Sample slow traces
    if trace.duration_ms > 500:
        return True

    # Sample traces calling specific services (high-value)
    if any(span.service == "payment_service" for span in trace.spans):
        return True

    # Otherwise, sample 0.1% for baseline visibility
    if random.random() < 0.001:
        return True

    return False

Pros: Captures interesting traces that head-based sampling would miss. Cons: Requires buffering all traces before deciding, adds latency, higher cost in the collector.

Adaptive Sampling

Adjust sampling rates based on traffic and cost:

class AdaptiveSampler:
    def __init__(self):
        self.sampling_rate = 0.01  # Start at 1%
        self.spans_per_second_limit = 100000

    def should_sample(self, trace):
        # If we're approaching our span budget, reduce sampling
        current_rate = self.estimate_current_span_rate()
        if current_rate > self.spans_per_second_limit * 0.8:
            self.sampling_rate = max(0.0001, self.sampling_rate * 0.9)
        else:
            self.sampling_rate = min(1.0, self.sampling_rate * 1.1)

        return random.random() < self.sampling_rate

Pro Tip: Most companies use tail-based sampling in production because it captures the traces that matter (errors, slow requests) while still controlling costs.

Trace Backends: Where Traces Live

Different platforms specialize in trace storage and analysis:

BackendGood ForTrade-offs
JaegerOpen-source, self-hostedRequires operational expertise
ZipkinSimpler open-source tracingLess feature-rich than Jaeger
Tempo (Grafana)Integrated with Grafana, cost-effectiveNewer, smaller ecosystem
Datadog APMFully managed, great UIExpensive, vendor lock-in
New RelicIntegrated monitoring + tracingHigh cost
AWS X-RayAWS native, simple setupLimited if multi-cloud

For most companies, the stack is:

  • Development: OpenTelemetry SDK in application code
  • Collection: OpenTelemetry Collector (processes, samples, batches)
  • Storage: Jaeger/Tempo/Datadog
  • Visualization: Grafana/Jaeger UI/Datadog UI

Diagnosing with Traces: A Real Example

Let’s walk through a trace-based diagnosis:

Problem: Checkout is slow for some users.

Step 1 - Metrics Alert: Error rate spike detected, p99 latency 5 seconds (normally 200ms).

Step 2 - Find Traces: Query traces for checkout service with duration_ms > 1000 in the last 5 minutes. You get 500 slow traces.

Step 3 - Examine a Trace: Pick one trace and view the waterfall:

checkout (5000ms)
├─ stripe_charge (4500ms) ← THE SLOW PART
│  ├─ http.request to stripe.com (4490ms)
│  └─ http.response (10ms)
├─ create_order (300ms)
├─ send_confirmation_email (200ms)

Step 4 - Investigate: The Stripe API call is taking 4.5 seconds! That’s unusual. Check Stripe’s status page. They’re experiencing latency. Not your problem.

Step 5 - Add Resilience: Add a timeout to Stripe calls. If Stripe takes over 2 seconds, fail fast instead of waiting 4.5 seconds.

Step 6 - Verify: Push the change, watch new traces. Now checkout completes in under 500ms even if Stripe is slow.

This entire diagnosis, with traces, takes under 5 minutes. Without traces, you’d be restarting services at 3 AM.

Semantic Conventions: Speaking the Same Language

OpenTelemetry defines semantic conventions so all applications use consistent attribute names:

HTTP Span:
  http.method = "POST"
  http.url = "https://api.example.com/checkout"
  http.status_code = 200
  http.request.body.size = 512
  http.response.body.size = 1024
  network.peer.address = "203.0.113.195"

Database Span:
  db.system = "postgresql"
  db.user = "app_user"
  db.connection_string = "postgresql://host/db"  (never include password!)
  db.statement = "SELECT * FROM orders WHERE id = ?"
  db.rows_affected = 1

Messaging Span:
  messaging.system = "kafka"
  messaging.destination = "orders-topic"
  messaging.operation = "publish"

With conventions, tools can understand spans from any application. A Grafana dashboard built for Kafka spans works for any system using Kafka.

Trade-offs and When NOT to Trace

Storage Costs: Traces are expensive. A single request might generate 50 spans. At scale, this is TB/day. Sampling helps but you’ll still pay more than metrics.

Performance Overhead: Creating spans and exporting them adds latency. Typically 1-3% overhead with good instrumentation. Still worth it, but not free.

Instrumentation Effort: Manual instrumentation is tedious. Auto-instrumentation helps but might miss business logic spans.

Vendor Lock-in: If you use a vendor’s proprietary tracing, switching is expensive. OpenTelemetry mitigates this but you still have some lock-in to the backend.

When NOT to Trace:

  • Simple monoliths with low latency variability (metrics + logs might suffice)
  • Real-time systems where 1-3% overhead is unacceptable
  • Teams without the expertise to operate a tracing system

Key Takeaways

  • Distributed tracing shows the complete request journey across all services, revealing bottlenecks that metrics can’t see.
  • Spans are the building blocks: operation name, duration, attributes, parent span ID.
  • Trace context propagation via W3C Trace Context is essential for linking spans across service boundaries.
  • OpenTelemetry is the vendor-neutral standard; use it to decouple from specific backends.
  • Sampling is mandatory at scale. Tail-based sampling captures errors and slow requests while controlling costs.
  • Semantic conventions ensure consistency across teams and make tooling smarter.

Practice Scenarios

Scenario 1: Your recommendation service is slow. The metrics show it’s taking 800ms p99, but each individual service it calls (user service, product service, cache) is fast. Design a tracing strategy to diagnose this. What spans would you create? What attributes would you track? How would you identify the bottleneck?

Scenario 2: You have a microservices system processing messages from Kafka. A message goes through 5 services before completion. Design trace context propagation through the Kafka messages so the entire flow is traced as a single trace. How would you extract and propagate the trace ID in the message metadata?

Scenario 3: Your company processes 100,000 requests per second. Tracing every request costs $100,000/month. Tracing 10% costs $10,000/month. Design a sampling strategy that captures:

  • All payment transactions (business-critical)
  • All errors and timeouts
  • Slow requests (p99 latency)
  • Representative sample of normal requests (for baseline understanding)

What trade-offs does your strategy make?


Next: While metrics, KPIs, and traces help you understand system behavior in real-time, you also need searchable, queryable logs for post-incident analysis. In the next section, we’ll explore centralized logging architectures and how to structure logs so they’re actually useful when debugging production issues hours or days after they occur.