System Design Fundamentals

Inter-Service Communication

A

Inter-Service Communication

When Function Calls Become Network Requests

In a monolith, calling another module is straightforward: orderService.calculateTotal(order). It’s fast, reliable, and your compiler ensures type safety. But in a microservices architecture, that same operation becomes a network request across the internet — potentially slow, unreliable, and vulnerable to failure modes that don’t exist within a single process.

This fundamental shift defines everything about your system’s reliability, performance, and operational complexity. How services communicate isn’t just a technical detail; it’s a strategic choice that ripples through your entire architecture. Choose poorly, and you’ll spend years fighting cascading failures and debugging distributed transactions. Choose well, and your system remains maintainable and responsive even as complexity grows.

In the previous section, we discussed service boundaries and how to slice a domain into independent services. Now we tackle the harder question: how do these independent services actually work together?

Synchronous vs. Asynchronous Communication

The most fundamental decision is whether communication should be synchronous or asynchronous.

Synchronous communication follows the request-response pattern: service A sends a request to service B and waits for a response before continuing. This is simple to reason about. You get immediate feedback about success or failure. Your code reads like traditional function calls.

Asynchronous communication decouples the request from the response. Service A sends a message (either a command or an event) and continues processing. Service B handles that message whenever it’s ready. The sender doesn’t wait.

These patterns create fundamentally different properties in your system:

PropertySynchronousAsynchronous
CouplingTemporal coupling — services must be available at the same timeTemporal decoupling — services are independent
ConsistencyStrong consistency — response confirms completionEventual consistency — multiple steps required
ComplexitySimple workflows, easy to debugComplex workflows, harder to trace
Failure impactCascading failures commonIsolated failures, easier recovery
LatencySum of all service latenciesOnly sender’s latency (if one-way)
Reason aboutNatural, familiarRequires distributed tracing to understand

Thinking in Analogies

A synchronous call is like a phone call. You dial, wait on the line, and talk until you get your answer. If the person isn’t available, the call fails immediately. You know right away whether they can help you.

An asynchronous command is like sending a letter or email. You write what you need done and drop it off. You don’t stand there waiting for a response — you move on to other things. You trust it will be handled. The receiver responds back when they’ve completed the task (or they don’t, if they only needed to inform you of completion).

An event is like a newspaper or announcement board. You publish what happened — “The inventory system just processed a shipment” — and anyone interested reads it. You’re not sending a specific command to anyone; you’re announcing something occurred. Multiple systems might react to that same event independently.

An orchestrator is like a conductor leading an orchestra. One central authority understands the workflow (“First violins, then oboes, now brass”) and coordinates each section. There’s a clear leader.

Choreography is like jazz improvisation. There’s no conductor. Each musician listens to what others are playing and responds musically. The emergent result works because everyone understands the style and listens intently.

Synchronous Patterns in Practice

REST Over HTTP

The most common synchronous pattern is REST over HTTP. It’s simple, works across languages, and every developer understands it. You make an HTTP request, get back a response.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class OrderServiceClient:
    def __init__(self, base_url="http://order-service:8080"):
        self.base_url = base_url
        self.session = self._create_resilient_session()

    def _create_resilient_session(self):
        session = requests.Session()
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=["HEAD", "GET", "OPTIONS"]
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        return session

    def create_order(self, user_id, items):
        try:
            response = self.session.post(
                f"{self.base_url}/api/orders",
                json={"userId": user_id, "items": items},
                timeout=5
            )
            response.raise_for_status()
            return response.json()
        except requests.exceptions.Timeout:
            raise OrderServiceUnavailableError("Order service timeout")
        except requests.exceptions.ConnectionError:
            raise OrderServiceUnavailableError("Order service unreachable")

This pattern is great for simple queries or commands where you need immediate feedback. But notice the defensive programming required: explicit timeouts, retry logic, and exception handling for network failures. This is the reality of distributed systems.

gRPC for Performance-Critical Paths

When latency matters — say, you’re aggregating data from five different services for a frontend request — REST’s overhead becomes noticeable. HTTP headers, JSON serialization, and text parsing add up. gRPC uses Protocol Buffers (binary format) and HTTP/2 multiplexing, resulting in 5-10x faster calls.

syntax = "proto3";

service InventoryService {
  rpc ReserveItems(ReservationRequest) returns (ReservationResponse);
  rpc ReleaseItems(ReleaseRequest) returns (google.protobuf.Empty);
}

message ReservationRequest {
  string order_id = 1;
  repeated LineItem items = 2;
}

message LineItem {
  string product_id = 1;
  int32 quantity = 2;
}

message ReservationResponse {
  bool success = 1;
  string reservation_id = 2;
  string error_reason = 3;
}

The tradeoff: gRPC is more rigid (you must define messages in advance), harder to debug (binary format), and requires more tooling. Use it where latency is critical. For typical CRUD operations, REST’s simplicity usually wins.

Circuit Breakers: Preventing Cascading Failures

When service B is down, you don’t want service A hammering it with requests. A circuit breaker is a pattern that tracks failures and temporarily stops sending requests.

// Using Resilience4j
CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("paymentService");
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50.0f)
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .permittedNumberOfCallsInHalfOpenState(5)
    .slowCallRateThreshold(50.0f)
    .slowCallDurationThreshold(Duration.ofSeconds(2))
    .build();

CircuitBreaker resilientCircuitBreaker = CircuitBreaker.of("paymentService", config);

public PaymentResult processPayment(Order order) {
    return circuitBreaker.executeSupplier(() ->
        paymentClient.charge(order.getTotalAmount())
    );
}

The circuit breaker tracks failures. When the failure rate exceeds a threshold, it “opens” — immediately rejecting requests for a time. This gives the failing service time to recover without being buried under traffic.

Asynchronous Patterns in Practice

Asynchronous Commands

A command says “do something.” The caller sends it and moves on. Commands work well for fire-and-forget operations where you don’t need immediate confirmation.

Example: User clicks “Send Email Notification.” Rather than waiting for the email service to connect to SMTP (slow!), the order service sends a command to the notification service via a message queue and immediately returns success to the user.

import json
from typing import Dict, Any

class OrderService:
    def __init__(self, message_queue):
        self.queue = message_queue

    def place_order(self, order: Dict[str, Any]) -> str:
        order_id = self.repository.save(order)

        # Command: tell notification service to send email
        self.queue.publish(
            "notifications:queue",
            {
                "type": "SendOrderConfirmation",
                "orderId": order_id,
                "userId": order["userId"],
                "email": order["email"],
                "totalAmount": order["totalAmount"]
            }
        )

        return order_id

class NotificationService:
    def process_command(self, command: Dict[str, Any]):
        if command["type"] == "SendOrderConfirmation":
            self.send_confirmation_email(
                command["email"],
                command["totalAmount"]
            )

Event-Driven Architecture

Events are more powerful than commands. An event says “something happened,” and multiple services can react independently. This decouples services beautifully.

class OrderService:
    def place_order(self, order: Dict[str, Any]) -> str:
        order_id = self.repository.save(order)

        # Event: publish what happened
        self.event_bus.publish(
            "orders.events",
            {
                "eventType": "OrderPlaced",
                "orderId": order_id,
                "userId": order["userId"],
                "items": order["items"],
                "totalAmount": order["totalAmount"],
                "timestamp": datetime.utcnow().isoformat(),
                "version": 1
            }
        )

        return order_id

class InventoryService:
    # Listens to OrderPlaced events
    def on_order_placed(self, event):
        self.reserve_items(event["orderId"], event["items"])
        self.event_bus.publish("orders.events", {
            "eventType": "ItemsReserved",
            "orderId": event["orderId"],
            "reservationId": uuid.uuid4().hex
        })

class BillingService:
    # Different service, same event
    def on_order_placed(self, event):
        invoice = self.create_invoice(event["orderId"], event["totalAmount"])
        self.event_bus.publish("orders.events", {
            "eventType": "InvoiceCreated",
            "orderId": event["orderId"],
            "invoiceId": invoice.id
        })

Now, if you add a loyalty service later, it can subscribe to the same OrderPlaced event. No existing code changes required.

Orchestration vs. Choreography

When a business process spans multiple services (like an order flow), you need coordination. There are two approaches.

Orchestration: The Conductor Pattern

A central orchestrator service knows the workflow and tells each service what to do.

OrderOrchestrator coordinates:
1. Call OrderService.CreateOrder()
2. Call InventoryService.ReserveItems()
3. Call PaymentService.ChargeCard()
4. Call ShippingService.SchedulePickup()
5. Call NotificationService.SendConfirmation()

If any step fails, OrderOrchestrator handles rollback or retry.

Advantages:

  • Clear workflow logic in one place
  • Easy to understand the business process
  • Centralized error handling and compensation

Disadvantages:

  • Orchestrator becomes a bottleneck and single point of failure
  • Orchestrator couples to many services
  • Changes to the workflow require orchestrator changes

Choreography: The Jazz Improvisation Pattern

Each service listens to events and reacts independently. No central coordinator.

1. OrderService publishes "OrderPlaced"
2. InventoryService listens, reserves items, publishes "ItemsReserved"
3. PaymentService listens to "ItemsReserved", charges card, publishes "PaymentProcessed"
4. ShippingService listens to "PaymentProcessed", schedules pickup, publishes "ShipmentScheduled"
5. NotificationService listens to "ShipmentScheduled", sends confirmation

Each service knows its own business rules.

Advantages:

  • Services are truly decoupled
  • Each service can evolve independently
  • No single point of failure (except message broker)

Disadvantages:

  • Workflow logic is distributed — harder to see the big picture
  • Debugging is complex (distributed tracing essential)
  • Circular dependencies possible if not careful

Practical advice: Use orchestration for critical business workflows (orders, payments) where you need clear control and visibility. Use choreography for loosely-coupled events like notifications and analytics.

The Transactional Outbox Pattern

Here’s a common problem: you update your database and want to publish an event, but the system crashes between the two operations. Now you’ve lost the event.

The solution is the transactional outbox pattern:

class OrderService:
    def place_order(self, order: Dict[str, Any]) -> str:
        # This all happens in ONE database transaction
        with self.db.transaction():
            order_id = self.repository.save(order)

            # Same transaction! Event is durably stored
            self.outbox.insert({
                "eventId": uuid.uuid4().hex,
                "eventType": "OrderPlaced",
                "aggregateId": order_id,
                "payload": json.dumps({
                    "orderId": order_id,
                    "userId": order["userId"],
                    "items": order["items"],
                    "totalAmount": order["totalAmount"]
                }),
                "createdAt": datetime.utcnow(),
                "published": False
            })

        return order_id

# Separate process polls outbox and publishes events
class OutboxPoller:
    def poll(self):
        unpublished = self.db.query(
            "SELECT * FROM outbox WHERE published = FALSE LIMIT 100"
        )

        for event in unpublished:
            try:
                self.event_bus.publish(event.eventType, event.payload)
                self.db.execute(
                    "UPDATE outbox SET published = TRUE WHERE eventId = ?",
                    event.eventId
                )
            except Exception as e:
                # Event will be retried next poll
                logger.error(f"Failed to publish event: {e}")

The outbox guarantees every database change produces a corresponding event — eventually. This is the foundation of reliable event-driven systems.

Service Mesh: Managing Communication Transparently

As your microservices grow, you’re adding the same concerns to every service: retries, timeouts, circuit breakers, tracing, mTLS encryption. This is tedious and error-prone.

A service mesh (Istio, Linkerd) runs a sidecar proxy alongside each service. The proxy handles communication concerns transparently, without changing application code.

# Istio VirtualService configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
spec:
  hosts:
  - order-service
  http:
  - match:
    - uri:
        prefix: "/api/orders"
    route:
    - destination:
        host: order-service
        port:
          number: 8080
    timeout: 5s
    retries:
      attempts: 3
      perTryTimeout: 2s
    fault:
      delay:
        percentage: 10
        fixedDelay: 1s

The proxy handles retries, timeouts, fault injection (for testing), load balancing, and mutual TLS — all without touching your application code. This is powerful, but adds operational complexity. Avoid it early; adopt it when you have enough services that the overhead becomes worth it.

A Decision Matrix for Communication Patterns

REST SyncgRPC SyncCommandsEvents
Best forQueries, simple commandsPerformance-critical readsOne-time actionsNotifications, state changes
LatencyMediumLowNot dependentNot dependent
CouplingHighHighMediumLow
ConsistencyStrongStrongEventualEventual
DebuggingEasyMediumMediumHard
Failure modeCascading failuresCascading failuresIsolatedIsolated
ExampleFetching user profileAggregating service dataSend password reset emailOrder placed event

Key Takeaways

  • Synchronous communication is simple but couples services: REST and gRPC are familiar and straightforward, but they create temporal coupling. When one service is slow, all callers are slow. Use circuit breakers to prevent cascades.

  • Asynchronous communication decouples but adds complexity: Commands and events free you from waiting, but you must embrace eventual consistency and distributed tracing becomes essential.

  • Events are more flexible than commands: Events describe what happened; multiple services react independently. Commands say “do this,” tightly coupling intent to action.

  • Choreography scales better than orchestration: As the number of services grows, orchestration becomes a bottleneck. Choreography distributes workflow logic, but demands careful observability.

  • The transactional outbox pattern ensures reliability: Publish events durably alongside database changes, eliminating the risk of lost events.

  • Choose service mesh when operational complexity justifies it: Early adoption adds overhead; adopt it when you’re managing 10+ services with consistent communication concerns.

Practice Scenarios

Scenario 1: The Payment Timeout Crisis

Your payment service is experiencing intermittent slowness (5-10 second responses). Suddenly, your entire order system grinds to a halt — even simple operations time out. Users can’t create orders at all.

What went wrong? How would you fix it with the patterns we discussed? What’s the minimal change to prevent this?

Scenario 2: Event-Driven Reliability

You’re building a checkout flow: OrderService → InventoryService → PaymentService → ShippingService. You want to use an event-driven approach so each service is independent.

Design this workflow. What events are published at each stage? What happens if the PaymentService crashes after items are reserved but before payment is charged? How do you recover?

Scenario 3: Choosing Your Communication Strategy

You’re building an e-commerce platform. Your first five services are:

  1. UserService — manages profiles and authentication
  2. ProductService — product catalog
  3. OrderService — order management
  4. InventoryService — stock levels
  5. PaymentService — processes charges

For each interaction below, decide: sync (REST), sync (gRPC), async command, or async event? Justify your choice.

  • UserService querying ProductService for recommended items to show on homepage
  • OrderService reserving inventory during checkout
  • OrderService charging the credit card
  • InventoryService notifying AdminService when stock is low
  • ProductService requesting user reviews from ReviewService

In the next section, we’ll explore how services discover each other — how does OrderService know where InventoryService is running? This is the service discovery problem, and it’s surprisingly complex in dynamic cloud environments where services scale up and down constantly.