Centralized Logging
When Logs Scatter Everywhere (And Why That’s Dangerous)
Picture this: It’s Tuesday afternoon, and your payment processing team reports that some transactions are failing. Your system architecture looks straightforward — a single payment endpoint. But underneath, there are 50 microservice instances running across 20 Kubernetes pods, each generating logs. When a user’s payment fails, the relevant logs are scattered across multiple containers. Some pods have already restarted (their logs are gone). Others are responding slowly (took 30 seconds to query the logs). You’re spending 90 minutes trying to piece together what happened.
This is why we need centralized logging — a single authoritative place where all logs from all services are collected, indexed, and searchable. In the distributed world where containers are ephemeral and services are plentiful, you can’t SSH into instances to tail logs. You need observability.
The Logging Pipeline: From Source to Insight
Think of centralized logging as an assembly line with five stations:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Collection │────→│ Transport │────→│ Processing │
│ (Agents) │ │ (Reliable) │ │ (Parse/Enrich)
└──────────────┘ └──────────────┘ └──────────────┘
│
↓
┌──────────────┐
│ Storage │
│(Indexed/Fast)│
└──────────────┘
│
↓
┌──────────────┐
│Visualization │
│ (Dashboard) │
└──────────────┘
Collection: Agents run on each machine or as sidecars in containers, watching application logs and infrastructure logs (stdout, stderr, syslog files). Popular collectors: Fluentd, Filebeat, Logstash, Vector.
Transport: Logs travel over the network to your central logging infrastructure. This needs to be reliable — if a collector crashes, logs shouldn’t be lost. Use batching and retries.
Processing: Raw logs are often messy — multiline stack traces, mixed JSON and plaintext, missing context. The pipeline parses them (extract timestamp, severity, message), enriches them (add pod name, namespace, Kubernetes metadata, version info), filters them (drop noisy debug logs), and transforms them (mask sensitive data).
Storage: Logs are stored in an indexed, queryable format. Elasticsearch is the popular choice (full-text search, powerful queries, great for analytics). Grafana Loki is an alternative (uses labels instead of full-text indexing — much cheaper for high-volume logging).
Visualization: Dashboards and UIs let you search, filter, and understand logs. Kibana (for Elasticsearch), Grafana, Datadog, and Cloud providers all offer interfaces.
ELK, EFK, Loki, and Cloud Solutions
The ELK Stack (Elasticsearch, Logstash, Kibana) has dominated centralized logging for a decade. It’s powerful: full-text search, complex queries, real-time analytics. But it’s also expensive and complex to operate. Many teams run into performance issues at scale.
The EFK Stack (Elasticsearch, Fluentd, Kibana) is similar, but swaps Logstash (heavier, Java-based) for Fluentd (lighter, Ruby-based). Fluentd is more extensible and generally easier on resources.
Grafana Loki is the newcomer (2018+). Instead of indexing every word in every log, Loki indexes only labels (job name, pod, namespace, environment). The logs themselves are stored compressed and unindexed. This means Loki costs 1/10th as much as Elasticsearch for the same volume — but queries are less flexible. You can’t search arbitrary text; you search by labels then grep within results.
Cloud-native solutions (AWS CloudWatch, Google Cloud Logging, Azure Monitor) handle collection, storage, and visualization for you. You don’t manage infrastructure, but you’re locked into one vendor and pricing can be opaque.
Pro Tip: For most teams starting out, Grafana Loki or a managed solution is the right choice. ELK is powerful, but the operational burden often isn’t worth it unless you have dedicated infrastructure expertise.
A Library Analogy
Imagine a national library system with thousands of branches. Each branch (microservice) keeps its own card catalog of books (logs). When you need to find a specific book, you could visit every branch and ask the librarian — but that’s impossibly slow and doesn’t work if the branch is closed (pod restarted, logs deleted).
Instead, there’s a central catalog system (centralized logging) that indexes every book in every branch. Now you can search from home: “Find all books about payment failures from the last hour.” The central system tells you which branches have them.
How We Collect Logs in Practice
There are three main patterns for log collection in containerized environments:
1. Sidecar Pattern
Each pod gets a log-collecting sidecar (e.g., Fluentd) that runs alongside your application. The sidecar reads the application’s log file and ships it to your central store.
Pros: Isolated per application, can configure per-app collection rules. Cons: Adds overhead (one extra container per pod), more complex deployments.
2. DaemonSet Pattern
One collector (e.g., Fluentd) runs on every Kubernetes node. It reads logs from all containers on that node by mounting the host’s filesystem.
Pros: Resource-efficient (shared collector), simple cluster setup. Cons: Less per-application control, harder to apply app-specific collection rules.
3. Direct Shipping (Library Pattern)
Applications send logs directly to the central store (e.g., application logs directly to Loki via a client library).
Pros: Simple, no extra infrastructure. Cons: Places logging burden on application, harder to handle transient failures.
Most teams use DaemonSet for infrastructure logs (kubelet, system logs) and either Sidecar or Direct Shipping for application logs.
Structured Logging: Design for Searchability
Raw logs look like this:
2024-02-13 14:32:05 ERROR: Failed to process payment
Structured logs look like this:
{
"timestamp": "2024-02-13T14:32:05Z",
"level": "ERROR",
"message": "Failed to process payment",
"user_id": "user_456",
"transaction_id": "txn_789",
"trace_id": "trace_abc123",
"error_code": "PAYMENT_GATEWAY_TIMEOUT",
"gateway": "stripe",
"retry_count": 2
}
Structured logging gives you searchability: find all payment errors for user_456 in the last hour, or all errors from the Stripe gateway. Here’s how to implement it:
Establish a JSON schema. Document what fields every log should include. Example:
timestamp(ISO 8601)level(DEBUG, INFO, WARNING, ERROR, CRITICAL)message(human-readable summary)service(name of the service)trace_id(connects logs across services)request_id(connects logs within a single request)user_id(for accountability)- Custom fields per service
Use correlation IDs. Every request that enters your system gets a unique trace_id. As the request flows through services, each log includes that trace_id. Later, you search for trace_id=xyz and see the entire request journey.
Get log levels right. DEBUG for detailed diagnostic info (unneeded in production). INFO for significant events (startup, shutdown, important milestones). WARNING for recoverable issues (retry #1, slow query). ERROR for problems requiring action. CRITICAL for system-threatening issues.
Processing and Transformation
Raw logs are only valuable after processing. This happens in the pipeline:
Parsing: Extract structure from unstructured text. Grok patterns (regex) parse Apache logs, Java stack traces, syslog output, etc. For structured logs, just validate JSON.
Enrichment: Add contextual metadata. Your Fluentd collector knows the pod name and namespace; add them to every log. Add the Git commit SHA of the running version. Add the Kubernetes cluster name. These enrich logs without the application needing to know about it.
Filtering: Drop logs you don’t need. In production, maybe drop all DEBUG logs (save space and cost). Drop health check logs (they’re noisy and uninteresting). Drop requests to /healthz endpoint.
Transformation: Redact sensitive data. Before storing, scan for credit card patterns, API keys, passwords, and replace them with [REDACTED]. This prevents accidental exposure if someone gains access to your logs.
Example filter rule (Fluentd):
- Match: level="DEBUG" AND environment="production"
Action: Drop
- Match: path contains "password" OR path contains "api_key"
Action: Replace with [REDACTED]
Storage: Hot, Warm, Cold
Logs grow fast. A single service generating 100 MB/second will produce 8.6 TB per day. You can’t keep everything hot (real-time searchable). Use a tiered approach:
- Hot tier: Last 7 days, full-text indexed, immediately searchable. Lives in Elasticsearch/Loki.
- Warm tier: 7-30 days, still queryable but slower, reduced index granularity.
- Cold tier: Beyond 30 days, compressed and archived in S3/GCS, queryable only for compliance/forensics (slow and expensive).
Use index lifecycle policies to automatically move logs between tiers. Save 70% of storage costs this way.
Searching Logs: Finding the Needle
Once logs are centralized, you search them. Some common patterns:
# All errors in the payment service in the last hour
level="ERROR" AND service="payment-service" AND @timestamp > now-1h
# All requests that took more than 5 seconds
duration_ms > 5000 AND @timestamp > now-1h
# All errors related to a specific user
user_id="user_456" AND level="ERROR"
# All requests from a specific transaction
trace_id="trace_abc123"
Advanced teams extract metrics from logs. “From logs, extract the p99 latency of the payment API” — this is cheaper than maintaining a separate metrics system for every possible dimension.
Security and Compliance
Logs are sensitive. They contain user IDs, transaction details, error messages that might reveal system internals, and sometimes accidentally captured passwords.
- Access control: Not everyone should read logs. Restrict log access to on-call engineers and compliance auditors.
- PII redaction: Automatically scrub logs for personally identifiable information before storage.
- Audit logs: Maintain separate audit logs for compliance (GDPR, HIPAA, SOX). Document who accessed which logs and when.
- Retention: Delete logs after the retention period (30 days is common; GDPR requires eventual deletion unless you have a legitimate reason to keep them).
Trade-offs at Scale
ELK/Elasticsearch vs. Loki: Elasticsearch gives you powerful, flexible queries. Loki saves 10x on costs. Choose based on whether you need to search arbitrary text or if label-based search is enough.
Self-hosted vs. managed: Self-hosted (ELK on your Kubernetes cluster) gives control and avoids vendor lock-in. Managed (Datadog, CloudWatch, Cloud Logging) eliminates operational burden.
Log volume management: Logging everything is expensive and creates a security liability. But logging too little means you miss context when things break. Find the balance: log all INFO/ERROR/WARNING, sample DEBUG.
Storage costs: At 100 MB/s, you’re paying thousands per month for storage. Aggressive filtering, compression, and tiering are essential.
Did You Know? Some teams have accidentally logged cryptographic keys or database credentials. A single forgotten
print(api_key)in error handling can expose your infrastructure to the world if logs are readable. Automated redaction patterns are a safety net.
Key Takeaways
-
Centralized logging is non-negotiable in distributed systems — logs from ephemeral containers must be collected to a durable, searchable store before the container disappears.
-
Structured logging (JSON) is vastly better than plaintext — invest in it early; retrofitting later is painful.
-
Correlation IDs (trace_id, request_id) are your debugging superpower — they let you follow a single request through 10 services in seconds.
-
Choose your tool based on query needs: Full-text search (ELK), label-based (Loki), managed convenience (Cloud providers).
-
Process logs aggressively: filter, enrich, redact, and tier by age — this saves money and protects privacy.
-
Don’t log PII unless you must, and redact it if you do — logs are a compliance risk if not handled carefully.
Practice Scenarios
Scenario 1: The 2 AM Page Your on-call engineer gets paged at 2 AM. An alert says “payment errors spiking.” She logs into your centralized logging system and searches:
service="payment-service" AND level="ERROR" AND @timestamp > now-1h
She sees 500 errors in the last 15 minutes. Narrowing down:
service="payment-service" AND error_code="DATABASE_CONNECTION_TIMEOUT"
Aha — the database connection pool is exhausted. She pages the database team. Correlation IDs in the logs show that users started timing out at exactly 1:47 AM — when a deployment rolled out a change that increased database connection usage.
Scenario 2: Compliance Audit Your compliance team needs to audit who accessed customer payment data in January. You query:
user_id="user_456" AND @timestamp between 2024-01-01 and 2024-02-01
You see logs from your API service that processed requests for this user. You also have audit logs showing that the on-call engineer at 3 PM on Jan 15 explicitly searched for this user’s records (unusual, triggers investigation). Centralized logging with proper audit trails lets you answer these questions.
Scenario 3: The PII Leak Prevention
A developer accidentally pushed code with logger.error(f"Failed login for user: {username}, password attempt: {password}"). The automated redaction rule catches this before storage:
Input: "Failed login for user: [email protected], password attempt: secret123"
Output: "Failed login for user: [email protected], password attempt: [REDACTED]"
Without the redaction pipeline, a security incident would be incoming.
Next, we’ll explore what to do when logs show problems: alerting and incident response. Logs are only valuable if someone is paying attention.