Inter-Service Communication
When Function Calls Become Network Requests
In a monolith, calling another module is straightforward: orderService.calculateTotal(order). It’s fast, reliable, and your compiler ensures type safety. But in a microservices architecture, that same operation becomes a network request across the internet — potentially slow, unreliable, and vulnerable to failure modes that don’t exist within a single process.
This fundamental shift defines everything about your system’s reliability, performance, and operational complexity. How services communicate isn’t just a technical detail; it’s a strategic choice that ripples through your entire architecture. Choose poorly, and you’ll spend years fighting cascading failures and debugging distributed transactions. Choose well, and your system remains maintainable and responsive even as complexity grows.
In the previous section, we discussed service boundaries and how to slice a domain into independent services. Now we tackle the harder question: how do these independent services actually work together?
Synchronous vs. Asynchronous Communication
The most fundamental decision is whether communication should be synchronous or asynchronous.
Synchronous communication follows the request-response pattern: service A sends a request to service B and waits for a response before continuing. This is simple to reason about. You get immediate feedback about success or failure. Your code reads like traditional function calls.
Asynchronous communication decouples the request from the response. Service A sends a message (either a command or an event) and continues processing. Service B handles that message whenever it’s ready. The sender doesn’t wait.
These patterns create fundamentally different properties in your system:
| Property | Synchronous | Asynchronous |
|---|---|---|
| Coupling | Temporal coupling — services must be available at the same time | Temporal decoupling — services are independent |
| Consistency | Strong consistency — response confirms completion | Eventual consistency — multiple steps required |
| Complexity | Simple workflows, easy to debug | Complex workflows, harder to trace |
| Failure impact | Cascading failures common | Isolated failures, easier recovery |
| Latency | Sum of all service latencies | Only sender’s latency (if one-way) |
| Reason about | Natural, familiar | Requires distributed tracing to understand |
Thinking in Analogies
A synchronous call is like a phone call. You dial, wait on the line, and talk until you get your answer. If the person isn’t available, the call fails immediately. You know right away whether they can help you.
An asynchronous command is like sending a letter or email. You write what you need done and drop it off. You don’t stand there waiting for a response — you move on to other things. You trust it will be handled. The receiver responds back when they’ve completed the task (or they don’t, if they only needed to inform you of completion).
An event is like a newspaper or announcement board. You publish what happened — “The inventory system just processed a shipment” — and anyone interested reads it. You’re not sending a specific command to anyone; you’re announcing something occurred. Multiple systems might react to that same event independently.
An orchestrator is like a conductor leading an orchestra. One central authority understands the workflow (“First violins, then oboes, now brass”) and coordinates each section. There’s a clear leader.
Choreography is like jazz improvisation. There’s no conductor. Each musician listens to what others are playing and responds musically. The emergent result works because everyone understands the style and listens intently.
Synchronous Patterns in Practice
REST Over HTTP
The most common synchronous pattern is REST over HTTP. It’s simple, works across languages, and every developer understands it. You make an HTTP request, get back a response.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
class OrderServiceClient:
def __init__(self, base_url="http://order-service:8080"):
self.base_url = base_url
self.session = self._create_resilient_session()
def _create_resilient_session(self):
session = requests.Session()
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
def create_order(self, user_id, items):
try:
response = self.session.post(
f"{self.base_url}/api/orders",
json={"userId": user_id, "items": items},
timeout=5
)
response.raise_for_status()
return response.json()
except requests.exceptions.Timeout:
raise OrderServiceUnavailableError("Order service timeout")
except requests.exceptions.ConnectionError:
raise OrderServiceUnavailableError("Order service unreachable")
This pattern is great for simple queries or commands where you need immediate feedback. But notice the defensive programming required: explicit timeouts, retry logic, and exception handling for network failures. This is the reality of distributed systems.
gRPC for Performance-Critical Paths
When latency matters — say, you’re aggregating data from five different services for a frontend request — REST’s overhead becomes noticeable. HTTP headers, JSON serialization, and text parsing add up. gRPC uses Protocol Buffers (binary format) and HTTP/2 multiplexing, resulting in 5-10x faster calls.
syntax = "proto3";
service InventoryService {
rpc ReserveItems(ReservationRequest) returns (ReservationResponse);
rpc ReleaseItems(ReleaseRequest) returns (google.protobuf.Empty);
}
message ReservationRequest {
string order_id = 1;
repeated LineItem items = 2;
}
message LineItem {
string product_id = 1;
int32 quantity = 2;
}
message ReservationResponse {
bool success = 1;
string reservation_id = 2;
string error_reason = 3;
}
The tradeoff: gRPC is more rigid (you must define messages in advance), harder to debug (binary format), and requires more tooling. Use it where latency is critical. For typical CRUD operations, REST’s simplicity usually wins.
Circuit Breakers: Preventing Cascading Failures
When service B is down, you don’t want service A hammering it with requests. A circuit breaker is a pattern that tracks failures and temporarily stops sending requests.
// Using Resilience4j
CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("paymentService");
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50.0f)
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(5)
.slowCallRateThreshold(50.0f)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.build();
CircuitBreaker resilientCircuitBreaker = CircuitBreaker.of("paymentService", config);
public PaymentResult processPayment(Order order) {
return circuitBreaker.executeSupplier(() ->
paymentClient.charge(order.getTotalAmount())
);
}
The circuit breaker tracks failures. When the failure rate exceeds a threshold, it “opens” — immediately rejecting requests for a time. This gives the failing service time to recover without being buried under traffic.
Asynchronous Patterns in Practice
Asynchronous Commands
A command says “do something.” The caller sends it and moves on. Commands work well for fire-and-forget operations where you don’t need immediate confirmation.
Example: User clicks “Send Email Notification.” Rather than waiting for the email service to connect to SMTP (slow!), the order service sends a command to the notification service via a message queue and immediately returns success to the user.
import json
from typing import Dict, Any
class OrderService:
def __init__(self, message_queue):
self.queue = message_queue
def place_order(self, order: Dict[str, Any]) -> str:
order_id = self.repository.save(order)
# Command: tell notification service to send email
self.queue.publish(
"notifications:queue",
{
"type": "SendOrderConfirmation",
"orderId": order_id,
"userId": order["userId"],
"email": order["email"],
"totalAmount": order["totalAmount"]
}
)
return order_id
class NotificationService:
def process_command(self, command: Dict[str, Any]):
if command["type"] == "SendOrderConfirmation":
self.send_confirmation_email(
command["email"],
command["totalAmount"]
)
Event-Driven Architecture
Events are more powerful than commands. An event says “something happened,” and multiple services can react independently. This decouples services beautifully.
class OrderService:
def place_order(self, order: Dict[str, Any]) -> str:
order_id = self.repository.save(order)
# Event: publish what happened
self.event_bus.publish(
"orders.events",
{
"eventType": "OrderPlaced",
"orderId": order_id,
"userId": order["userId"],
"items": order["items"],
"totalAmount": order["totalAmount"],
"timestamp": datetime.utcnow().isoformat(),
"version": 1
}
)
return order_id
class InventoryService:
# Listens to OrderPlaced events
def on_order_placed(self, event):
self.reserve_items(event["orderId"], event["items"])
self.event_bus.publish("orders.events", {
"eventType": "ItemsReserved",
"orderId": event["orderId"],
"reservationId": uuid.uuid4().hex
})
class BillingService:
# Different service, same event
def on_order_placed(self, event):
invoice = self.create_invoice(event["orderId"], event["totalAmount"])
self.event_bus.publish("orders.events", {
"eventType": "InvoiceCreated",
"orderId": event["orderId"],
"invoiceId": invoice.id
})
Now, if you add a loyalty service later, it can subscribe to the same OrderPlaced event. No existing code changes required.
Orchestration vs. Choreography
When a business process spans multiple services (like an order flow), you need coordination. There are two approaches.
Orchestration: The Conductor Pattern
A central orchestrator service knows the workflow and tells each service what to do.
OrderOrchestrator coordinates:
1. Call OrderService.CreateOrder()
2. Call InventoryService.ReserveItems()
3. Call PaymentService.ChargeCard()
4. Call ShippingService.SchedulePickup()
5. Call NotificationService.SendConfirmation()
If any step fails, OrderOrchestrator handles rollback or retry.
Advantages:
- Clear workflow logic in one place
- Easy to understand the business process
- Centralized error handling and compensation
Disadvantages:
- Orchestrator becomes a bottleneck and single point of failure
- Orchestrator couples to many services
- Changes to the workflow require orchestrator changes
Choreography: The Jazz Improvisation Pattern
Each service listens to events and reacts independently. No central coordinator.
1. OrderService publishes "OrderPlaced"
2. InventoryService listens, reserves items, publishes "ItemsReserved"
3. PaymentService listens to "ItemsReserved", charges card, publishes "PaymentProcessed"
4. ShippingService listens to "PaymentProcessed", schedules pickup, publishes "ShipmentScheduled"
5. NotificationService listens to "ShipmentScheduled", sends confirmation
Each service knows its own business rules.
Advantages:
- Services are truly decoupled
- Each service can evolve independently
- No single point of failure (except message broker)
Disadvantages:
- Workflow logic is distributed — harder to see the big picture
- Debugging is complex (distributed tracing essential)
- Circular dependencies possible if not careful
Practical advice: Use orchestration for critical business workflows (orders, payments) where you need clear control and visibility. Use choreography for loosely-coupled events like notifications and analytics.
The Transactional Outbox Pattern
Here’s a common problem: you update your database and want to publish an event, but the system crashes between the two operations. Now you’ve lost the event.
The solution is the transactional outbox pattern:
class OrderService:
def place_order(self, order: Dict[str, Any]) -> str:
# This all happens in ONE database transaction
with self.db.transaction():
order_id = self.repository.save(order)
# Same transaction! Event is durably stored
self.outbox.insert({
"eventId": uuid.uuid4().hex,
"eventType": "OrderPlaced",
"aggregateId": order_id,
"payload": json.dumps({
"orderId": order_id,
"userId": order["userId"],
"items": order["items"],
"totalAmount": order["totalAmount"]
}),
"createdAt": datetime.utcnow(),
"published": False
})
return order_id
# Separate process polls outbox and publishes events
class OutboxPoller:
def poll(self):
unpublished = self.db.query(
"SELECT * FROM outbox WHERE published = FALSE LIMIT 100"
)
for event in unpublished:
try:
self.event_bus.publish(event.eventType, event.payload)
self.db.execute(
"UPDATE outbox SET published = TRUE WHERE eventId = ?",
event.eventId
)
except Exception as e:
# Event will be retried next poll
logger.error(f"Failed to publish event: {e}")
The outbox guarantees every database change produces a corresponding event — eventually. This is the foundation of reliable event-driven systems.
Service Mesh: Managing Communication Transparently
As your microservices grow, you’re adding the same concerns to every service: retries, timeouts, circuit breakers, tracing, mTLS encryption. This is tedious and error-prone.
A service mesh (Istio, Linkerd) runs a sidecar proxy alongside each service. The proxy handles communication concerns transparently, without changing application code.
# Istio VirtualService configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- match:
- uri:
prefix: "/api/orders"
route:
- destination:
host: order-service
port:
number: 8080
timeout: 5s
retries:
attempts: 3
perTryTimeout: 2s
fault:
delay:
percentage: 10
fixedDelay: 1s
The proxy handles retries, timeouts, fault injection (for testing), load balancing, and mutual TLS — all without touching your application code. This is powerful, but adds operational complexity. Avoid it early; adopt it when you have enough services that the overhead becomes worth it.
A Decision Matrix for Communication Patterns
| REST Sync | gRPC Sync | Commands | Events | |
|---|---|---|---|---|
| Best for | Queries, simple commands | Performance-critical reads | One-time actions | Notifications, state changes |
| Latency | Medium | Low | Not dependent | Not dependent |
| Coupling | High | High | Medium | Low |
| Consistency | Strong | Strong | Eventual | Eventual |
| Debugging | Easy | Medium | Medium | Hard |
| Failure mode | Cascading failures | Cascading failures | Isolated | Isolated |
| Example | Fetching user profile | Aggregating service data | Send password reset email | Order placed event |
Key Takeaways
-
Synchronous communication is simple but couples services: REST and gRPC are familiar and straightforward, but they create temporal coupling. When one service is slow, all callers are slow. Use circuit breakers to prevent cascades.
-
Asynchronous communication decouples but adds complexity: Commands and events free you from waiting, but you must embrace eventual consistency and distributed tracing becomes essential.
-
Events are more flexible than commands: Events describe what happened; multiple services react independently. Commands say “do this,” tightly coupling intent to action.
-
Choreography scales better than orchestration: As the number of services grows, orchestration becomes a bottleneck. Choreography distributes workflow logic, but demands careful observability.
-
The transactional outbox pattern ensures reliability: Publish events durably alongside database changes, eliminating the risk of lost events.
-
Choose service mesh when operational complexity justifies it: Early adoption adds overhead; adopt it when you’re managing 10+ services with consistent communication concerns.
Practice Scenarios
Scenario 1: The Payment Timeout Crisis
Your payment service is experiencing intermittent slowness (5-10 second responses). Suddenly, your entire order system grinds to a halt — even simple operations time out. Users can’t create orders at all.
What went wrong? How would you fix it with the patterns we discussed? What’s the minimal change to prevent this?
Scenario 2: Event-Driven Reliability
You’re building a checkout flow: OrderService → InventoryService → PaymentService → ShippingService. You want to use an event-driven approach so each service is independent.
Design this workflow. What events are published at each stage? What happens if the PaymentService crashes after items are reserved but before payment is charged? How do you recover?
Scenario 3: Choosing Your Communication Strategy
You’re building an e-commerce platform. Your first five services are:
- UserService — manages profiles and authentication
- ProductService — product catalog
- OrderService — order management
- InventoryService — stock levels
- PaymentService — processes charges
For each interaction below, decide: sync (REST), sync (gRPC), async command, or async event? Justify your choice.
- UserService querying ProductService for recommended items to show on homepage
- OrderService reserving inventory during checkout
- OrderService charging the credit card
- InventoryService notifying AdminService when stock is low
- ProductService requesting user reviews from ReviewService
In the next section, we’ll explore how services discover each other — how does OrderService know where InventoryService is running? This is the service discovery problem, and it’s surprisingly complex in dynamic cloud environments where services scale up and down constantly.