Managing Distributed Data
The Distributed Data Challenge
When you build a monolith, life is simple: one database, ACID transactions, immediate consistency. Your application can join tables across domains, enforce foreign key constraints, and trust that a multi-row transaction either succeeds completely or fails completely. Consistency is free. Queries that span domains are trivial. You pay a price in scalability and agility, but data management is straightforward.
Now imagine scaling your architecture to microservices. Each service owns its data. There are no cross-service JOINs. There are no distributed transactions that automatically roll back changes across multiple databases. There is no single source of truth — data lives in multiple copies across multiple stores. A query that once hit a single database now requires coordinating answers from several independent services. Changes that once happened atomically now ripple asynchronously across the system.
This is the deepest structural challenge in microservices architecture. Every pattern we’ve covered so far — service decomposition, API design, resilience, deployment — ultimately comes down to this: how do we manage data when we’ve given up the comfort of a centralized store?
This chapter is where the theory becomes critical. We’ll explore the patterns that let us maintain consistency without sacrificing independence, query across service boundaries without creating coupling, and build systems that are both scalable and comprehensible.
Database Per Service: The Foundation and Its Problems
The database per service principle is non-negotiable in microservices. Each service manages its own data store. This separation of concerns provides crucial benefits:
- Independent scalability: Tune database resources per service based on actual load patterns.
- Technology diversity: Use PostgreSQL for the order service, MongoDB for user profiles, Elasticsearch for search.
- Fault isolation: One service’s database failure doesn’t cascade.
- Development autonomy: Teams can evolve their schemas without coordinating across the organization.
But this freedom comes with a cost. When you scatter data across service boundaries, you create a constellation of new problems:
No Cross-Service Joins: In a monolith, fetching order details with customer information is a single SQL query. In microservices, you need to fetch from the order service, then fetch from the customer service, then combine them in memory.
No Distributed Transactions: An atomic operation that spans services no longer exists. You can’t guarantee that an order update and an inventory deduction happen together or both fail together. Either one might succeed while the other fails, leaving your system in an inconsistent state.
Data Duplication: To avoid expensive cross-service calls, you often duplicate data. The order service keeps a copy of customer information. The inventory service keeps a copy of product details. Now you have multiple sources of truth, and they drift over time.
Staleness and Eventual Consistency: Data in one service becomes eventually consistent with its source. The product recommendation engine might be seconds or minutes behind the product service. You can’t assume the data you’re reading is current.
The patterns we’ll explore next are the tools for living with these constraints. They don’t eliminate the problems — they make them manageable.
The Core Patterns: Your Arsenal
Command Query Responsibility Segregation (CQRS)
CQRS separates the model that handles writes from the model that handles reads. They’re different data structures, potentially in different storage systems, optimized for their specific responsibilities.
How it works:
- Commands come in: “Update the order status to shipped.”
- The write model (a normalized schema optimized for business logic) processes the command.
- An event is published: “OrderShipped” with the order ID and timestamp.
- The read model (denormalized, optimized for queries) subscribes to that event and updates itself asynchronously.
- When a query arrives, it hits the read model directly.
The benefit is dramatic: your write model stays clean and business-focused. Your read model can be shaped exactly to the queries you need to answer. They scale independently.
The cost is complexity. You now have two data models to maintain, and the read model always lags slightly behind reality. Your system must be comfortable with temporary inconsistency.
Event Sourcing
Instead of storing the current state of an entity, you store every change that ever happened to it. Your database becomes an append-only log of events.
An account’s balance isn’t a single number in a row. It’s the sequence of “MoneyDeposited”, “MoneyWithdrawn”, “InterestAdded” events. To find the current balance, you replay all events for that account and aggregate them.
Why this matters:
- Complete audit trail: Every change is immutable history.
- Temporal queries: You can ask “what was the state at 3 PM yesterday?” by replaying events up to that point.
- Debugging and forensics: You can replay the exact sequence that led to a bug.
- Event-driven architecture: Other services naturally subscribe to the event stream.
The challenges:
- Querying becomes complex. You can’t just ask “SELECT * FROM orders WHERE status=‘shipped’”. You need to replay events and filter the results.
- Schema evolution gets complicated. Events represent decisions made in the past. If your business logic changes, old events might need to be interpreted differently.
- Snapshots are essential for performance. After replaying 10,000 events, you need a checkpoint to avoid replaying from the beginning each time.
API Composition
Fetch data from multiple services and combine the results in your application layer. The product detail page queries the product service for metadata, the review service for ratings, the inventory service for stock levels, and stitches them together.
Advantages:
- Simple and straightforward. No new infrastructure.
- Familiar to developers.
Disadvantages:
- Multiple network round-trips create latency.
- If any service is slow or down, the entire result is degraded.
- You’re creating runtime coupling between services through the composition logic.
Data Mesh and Domain-Oriented Ownership
Don’t treat “the data layer” as a monolithic problem. Instead, think of each domain owning and serving its data with self-serve infrastructure. The finance team owns financial data, provides APIs, publishes documentation. The marketing team owns customer segments. Each team is responsible for data quality, freshness, and availability in their domain.
This is organizational structure applied to data. It works when teams have clear, non-overlapping domains and the maturity to manage data infrastructure.
Change Data Capture (CDC) for Analytics
You want your data warehouse or analytics system to stay in sync with operational systems without creating coupling. CDC tools like Debezium listen to the operational database’s transaction log and stream changes to your data warehouse in real time.
The order service doesn’t know or care about the analytics pipeline. The database replication just happens. Data flows from operational systems to analytics without explicit integration points.
A Real-World Analogy
Think of how large companies operate. Each department — HR, Finance, Sales, Operations — maintains its own records.
When the CEO needs a company-wide report on employee productivity and revenue, nobody has the complete picture. You can’t use a SQL JOIN across departmental databases. So you have three options:
-
API Composition: Call each department and ask for their data. Finance gives you revenue, HR gives you headcount. You combine them in a spreadsheet. This works but requires synchronous calls to every department, and if one is slow, your report is delayed.
-
CQRS-like approach: Create a read-optimized executive dashboard that each department feeds into. Finance publishes revenue updates daily. HR publishes headcount weekly. The dashboard aggregates these feeds asynchronously. It’s a few hours behind reality, but always fast.
-
Event Sourcing: Every department publishes a ledger of their changes to a shared bulletin board. Finance publishes every transaction. HR publishes every hire, termination, promotion. The CEO can read the bulletin board and reconstruct the state at any point in time. It’s the most complete view, but requires reading and understanding many events.
Each approach trades off complexity, latency, and consistency. No single approach is correct in all contexts.
CQRS in Depth: Separation of Concerns
Let’s implement CQRS for an order service. The write model stays clean and domain-focused:
# Write model: optimized for business logic
class Order:
def __init__(self, id, customer_id):
self.id = id
self.customer_id = customer_id
self.items = []
self.status = "pending"
def add_item(self, product_id, quantity):
if quantity less than 1:
raise ValueError("Quantity must be positive")
self.items.append({"product_id": product_id, "quantity": quantity})
self.publish_event("ItemAdded", {"product_id": product_id, "qty": quantity})
def confirm(self):
if self.status != "pending":
raise ValueError("Order already confirmed")
self.status = "confirmed"
self.publish_event("OrderConfirmed", {"order_id": self.id})
The read model is denormalized, flat, and designed for specific queries:
# Read model: optimized for queries
order_read_model = {
"order_id": "ord-123",
"customer_id": "cust-456",
"customer_name": "Alice Johnson", # Denormalized from customer service
"status": "confirmed",
"total_amount": 249.99,
"item_count": 3,
"created_at": "2024-02-10T10:30:00Z",
"estimated_delivery": "2024-02-15", # Pre-computed for fast queries
}
When an event arrives, the read model updates asynchronously:
Customer Service publishes: CustomerUpdated(customer_id: "cust-456", name: "Alice J. Johnson")
↓
Order Service subscribes to CustomerUpdated events
↓
Order Service denormalization logic updates all orders with customer_id = "cust-456"
↓
Read model now reflects the updated customer name
Here’s the architecture:
graph TB
Client[Client]
WriteAPI["Write API<br/>(Commands)"]
WriteDB["Write Model<br/>PostgreSQL<br/>(Normalized)"]
EventBus["Event Bus<br/>(Kafka/RabbitMQ)"]
ReadAPI["Read API<br/>(Queries)"]
ReadDB["Read Model<br/>MongoDB<br/>(Denormalized)"]
Client -->|Command| WriteAPI
WriteAPI -->|Write| WriteDB
WriteDB -->|Publish Event| EventBus
EventBus -->|Update| ReadDB
Client -->|Query| ReadAPI
ReadAPI -->|Read| ReadDB
Benefits:
- Write model stays simple and focused on business rules.
- Read model shaped exactly to user needs (fast, denormalized).
- Scaling: add read replicas or entire read systems without touching write logic.
Trade-offs:
- More moving parts to deploy and monitor.
- Eventual consistency: read model is always a few moments behind writes.
- Requires careful handling of consistency windows.
Event Sourcing: Complete History as Your Database
Instead of a table of current account balances, store every transaction:
Event 1: AccountCreated(account_id: "acc-789", owner: "Bob", initial_balance: 1000)
Event 2: MoneyDeposited(account_id: "acc-789", amount: 500, timestamp: 2024-02-01)
Event 3: MoneyWithdrawn(account_id: "acc-789", amount: 200, timestamp: 2024-02-02)
Event 4: InterestAdded(account_id: "acc-789", amount: 10.50, timestamp: 2024-02-10)
To get the current balance, replay these events:
- Start at 0
- Create account with 1000 → 1000
- Deposit 500 → 1500
- Withdraw 200 → 1300
- Add interest 10.50 → 1310.50
The event store (an append-only log) is your source of truth. Everything else is derived.
Code example:
class EventStore:
def __init__(self):
self.events = []
def append(self, event):
self.events.append(event)
def get_state(self, account_id):
balance = 0
for event in self.events:
if event.account_id == account_id:
if event.type == "Deposited":
balance += event.amount
elif event.type == "Withdrawn":
balance -= event.amount
elif event.type == "InterestAdded":
balance += event.amount
return balance
Why it’s powerful:
- Complete audit trail: every change is immutable history.
- Temporal queries: “What was the balance on Feb 1st?” Just replay events up to that date.
- Debugging: Replay events to understand exactly what happened.
- Resilience: If the read model gets corrupted, rebuild it by replaying the event stream.
Why it’s complex:
- Querying requires scanning events and aggregating. Index by account_id and timestamp to scale.
- Schema evolution: An old event might not match today’s expected format. Handle versioning.
- Snapshots are essential: After 100,000 events, replaying from the start is slow. Create snapshots every N events.
graph LR
Command["Command<br/>(WithdrawMoney)"]
EventStore["Event Store<br/>(Append-only log)"]
Snapshot["Snapshot<br/>(State checkpoint)"]
Projection["Projection<br/>(Read model)"]
Query["Query"]
Command -->|Append| EventStore
EventStore -->|Every 1000 events| Snapshot
EventStore -->|Stream events| Projection
Snapshot -->|Speed up replay| EventStore
Projection -->|Answer| Query
API Composition and Its Costs
The product detail page needs information from three services:
GET /products/{id} → Product Service
GET /products/{id}/reviews → Review Service
GET /inventory/{id} → Inventory Service
A composition service (or API gateway) makes all three calls and stitches the response:
{
"id": "prod-123",
"name": "Laptop Stand",
"price": 49.99,
"description": "Adjustable aluminum stand",
"average_rating": 4.7,
"review_count": 234,
"in_stock": true,
"units_available": 15
}
Advantages:
- No new infrastructure or data patterns required.
- Works immediately.
Disadvantages:
- Latency: Three sequential network calls (or parallel calls with the slowest determining total time).
- Availability: If the review service is down, the entire product page fails.
- Runtime coupling: The composition logic tightly couples the three services.
Best used for:
- Small number of services (2-3).
- Data that’s frequently accessed together.
- Queries that aren’t on the critical path.
Change Data Capture (CDC) for Decoupled Analytics
You want your data warehouse to stay in sync with operational systems without the order service knowing about it.
graph LR
OrderDB["Order Database<br/>(PostgreSQL)"]
CDC["Debezium<br/>(CDC Tool)"]
Kafka["Kafka<br/>(Change Stream)"]
DataWarehouse["Data Warehouse<br/>(BigQuery/Snowflake)"]
Analytics["Analytics &<br/>Reporting"]
OrderDB -->|Transaction Log| CDC
CDC -->|Changed Records| Kafka
Kafka -->|Stream| DataWarehouse
DataWarehouse -->|Query| Analytics
The order service’s database is replicated in real time to the data warehouse without explicit coupling. No SDK, no API calls, no tight integration. The order service has no idea the analytics pipeline exists.
The Complexity Budget: Start Simple
Here’s the hard truth: not every service needs CQRS. Not every aggregate needs event sourcing. Not every read pattern justifies a separate data model.
Use this decision framework:
| Pattern | When to Use | When to Avoid |
|---|---|---|
| Database Per Service | Always | Never (it’s mandatory) |
| API Composition | Few services, acceptable latency, non-critical paths | More than 3-4 services, strict latency SLA, high availability requirement |
| CQRS | Multiple read patterns with different performance needs, independent scaling required | Simple CRUD operations, strong consistency required, small team |
| Event Sourcing | Complete audit trail essential, temporal queries needed, event-driven architecture valuable | Simple domain, no compliance audit requirements, immature team |
| CDC | Analytics or reporting sync, high decoupling required | Operational queries, real-time consistency required |
Start with database per service and simple API composition. Add CQRS when you’re struggling with query performance or scaling. Add event sourcing when you need the audit trail or temporal capabilities. Let your operational reality guide the complexity you introduce.
Real-World Example: Order Service with CQRS and CDC
# Write command
class PlaceOrder(Command):
customer_id: str
items: List[OrderItem]
# Event published
class OrderPlaced(Event):
order_id: str
customer_id: str
total_amount: float
timestamp: datetime
# Read model updated
order_summary = {
"order_id": "ord-456",
"customer_id": "cust-123",
"customer_name": "Carol Smith", # Denormalized
"total": 189.99,
"status": "pending",
"created_at": "2024-02-10T14:30:00Z"
}
# CDC streams change
order_raw = {
"id": "ord-456",
"customer_id": "cust-123",
"total": 189.99,
"status": "pending"
}
# → Flows to data warehouse
# → Available for analytics dashboards within seconds
Data Sharing Anti-Patterns and Best Practices
Anti-pattern: Shared Database
Two services share the same PostgreSQL database and schema. This is tempting because it’s easy at first. Then:
- Services can’t evolve independently.
- Database locks create contention.
- Deployment and scaling decisions require coordination.
- You lose the benefits of microservices entirely.
Don’t do this. Accept the duplication. Use the patterns above to manage it.
Reference Data Management
Product categories, country codes, currency rates — data that multiple services need but doesn’t change often.
Approach 1 (Duplication): Each service keeps a local copy, refreshed nightly. Simple. Works if data is small and change frequency is low.
Approach 2 (Shared Service): A reference data service. Other services query it or cache responses. Creates coupling but ensures consistency.
Approach 3 (Event-Driven): The reference data service publishes changes. Other services subscribe and update local caches. Decoupled and responsive.
Use Approach 3 for critical reference data. Use Approach 1 for truly static data (ISO country codes).
Key Takeaways
-
Database per service is non-negotiable but creates new challenges: no cross-service joins, no distributed transactions, data duplication, eventual consistency.
-
CQRS separates write and read models — optimize the write model for business logic, the read model for query performance. Adds complexity but enables independent scaling and tailored data shapes.
-
Event Sourcing stores history instead of state — every change is an immutable event. Provides complete audit trail and enables temporal queries, but requires careful handling of snapshots and schema evolution.
-
API Composition is simple but has limits — works for 2-3 services but creates latency and availability coupling at scale.
-
Change Data Capture decouples analytics — stream changes from operational systems to data warehouse without explicit integration.
-
Start simple and add complexity deliberately — use a decision framework to justify CQRS or event sourcing. Not every service needs them.
Practice Scenarios
Scenario 1: E-Commerce Platform You’re building an e-commerce system with a product catalog, inventory, order service, and recommendation engine. Orders affect inventory. Recommendations need real-time product data. Customers see order history and recommendations on the same dashboard. How would you manage data across these services? Consider latency, consistency windows, and scaling needs.
Scenario 2: Banking Application A bank needs to maintain account balances (must be accurate), transaction history (must be complete), and fraud detection. Fraud detection needs to analyze patterns across accounts in real time. How does event sourcing help here? What are the trade-offs with using a traditional normalized database?
Scenario 3: Multi-Tenant SaaS Your SaaS application has billing, user management, and feature services. Billing queries user data (subscription level, feature entitlements). User service queries billing for upgrade status. Reporting needs data from all three. What patterns apply here? Where would you introduce denormalization?
Looking Ahead: Observability and Resilience
We’ve now explored the complete picture of microservices architecture: decomposition, API design, resilience patterns, deployment strategies, and data management. In Chapter 17, we build on these foundations to design systems that are not just functional but also observable and resilient. You’ll learn how to instrument distributed systems so you can see what’s happening, detect failures before they cascade, and recover gracefully when things inevitably go wrong. Data consistency ensures correctness. Observability ensures you know when something is incorrect.