System Design Fundamentals

The Saga Pattern

A

The Saga Pattern

Introduction

Imagine you’re building an e-commerce platform with microservices. A customer places an order that needs to flow through multiple independent services: Order Service (create the order), Payment Service (charge the customer), Inventory Service (reserve stock), and Shipping Service (schedule delivery). Each service owns its own database. Each step must succeed.

But here’s the problem: what happens when the Payment Service declines the charge? Your order is already created. Inventory is already reserved. You need to undo those actions. In a monolithic system with a single database, you’d use a transaction that rolls everything back. But in microservices, there’s no distributed transaction mechanism that’s both fast and practical.

Two-Phase Commit (2PC) and Three-Phase Commit (3PC) protocols exist, but they’re problematic in distributed systems. 2PC requires a coordinator to lock all resources and guarantee atomicity—great for consistency, terrible for availability and partition tolerance. If the coordinator fails, resources stay locked. If a network partition occurs, you’re stuck. 3PC attempts to improve this but adds complexity and latency. For microservices spanning multiple databases, multiple deployments, and potentially multiple cloud providers, 2PC and 3PC are too blocking, too slow, and too fragile.

Enter the Saga Pattern: a way to execute a sequence of distributed transactions where each step is a local transaction within a single service, and if any step fails, compensating transactions undo the previously completed steps. No distributed locks. No coordinator holding everything. Each service commits independently and immediately. Instead of ACID atomicity, we achieve eventual consistency through coordinated compensating transactions.

What Is a Saga?

A Saga is a sequence of local transactions, each confined to a single microservice, coordinated to achieve a distributed business transaction. Formally:

  • Local Transaction: A transaction within a single service (single database, single resource manager).
  • Compensating Transaction: A new local transaction that logically reverses a previous local transaction. It’s not a rollback; it’s a new action with business meaning.
  • Coordination: The mechanism by which we decide which local transaction to execute next and when to trigger compensating transactions.

Key insight: In a Saga, each step commits immediately. There’s no “prepare phase” waiting for all participants. If step 3 fails, we don’t prevent step 1 and step 2 from being committed. Instead, we trigger compensating transactions for steps 2 and 1 to undo them.

Choreography vs. Orchestration

There are two primary ways to coordinate a Saga:

Choreography-Based Saga: Services communicate via domain events. When one service completes a step, it publishes an event. Other services listen to these events and trigger their own steps. It’s decentralized—no explicit coordinator.

  • Service A completes → publishes “OrderCreated” event
  • Service B hears the event → executes its step, publishes “PaymentProcessed” event
  • Service C hears that event → executes its step, publishes “InventoryReserved” event
  • And so on…

If Service C fails, it publishes “InventoryReservationFailed” event. Service B hears this and triggers a compensating transaction, publishes “PaymentRefunded” event. Service A hears it and triggers its compensating transaction.

Orchestration-Based Saga: A central Saga Orchestrator (a state machine or workflow engine) orchestrates the flow. The orchestrator sends commands to services, receives responses, and decides the next step. It maintains saga state and handles all compensation logic.

  • Orchestrator sends “CreateOrder” command to Service A
  • Service A responds with result
  • Orchestrator decides next step based on result, sends “ProcessPayment” command to Service B
  • And so on…

If Service B fails, the orchestrator explicitly triggers compensating transactions in the correct order.

The Concept of Semantic Undo

An important distinction: compensating transactions don’t roll back database changes; they apply a new transaction that logically reverses the effect.

For example:

  • Forward transaction: Debit $100 from account
  • Compensating transaction: Credit $100 back to account

Both are real transactions that hit the database. The second doesn’t undo the first at the database level; it semantically cancels the business effect. This is crucial because:

  1. Compensating transactions can fail too (though we handle this with retry logic and monitoring).
  2. Compensating transactions are visible in audit logs—you see the debit and credit, not a mysterious rollback.
  3. Compensating transactions can have different semantics than simple reversal (e.g., you might credit back a different account in case of fraud).

A Practical Analogy

You’re planning a multi-destination vacation: flights, hotels, and activities. You book each independently with separate payment confirmations and cancellation policies.

Choreography approach: You maintain a shared notes document. When you book a flight, you add a note. Your travel buddy sees this and books a hotel, adds a note. Another friend sees the hotel note and books activities, adds a note. If the activities are fully booked (step 3 fails), everyone sees the failed note and cancels their bookings in reverse order. No single person is coordinating; everyone reacts to the shared bulletin board.

Orchestration approach: You hire a travel agent. You tell the agent your requirements. The agent books the flight, informs you, waits for your confirmation, then books the hotel, then books activities. If any step fails, the agent automatically cancels previous bookings in the correct order. Single point of control, clear understanding of where things are in the process.

Semantic undo: Canceling a flight isn’t a database rollback of the original booking. It’s a new cancellation transaction. Both the booking and cancellation appear in your records. Similarly, you might not get a refund to the original credit card (semantic undo) but instead get a credit toward future travel.


Choreography-Based Sagas

In choreography, services are loosely coupled. Each service publishes domain events, and other services subscribe and react.

Architecture

┌─────────────────┐          ┌──────────────────┐
│  Order Service  │          │ Payment Service  │
├─────────────────┤          ├──────────────────┤
│ Event Bus:      │────────>│ Subscribers:     │
│  OrderCreated   │          │  OrderCreated    │
└─────────────────┘          └──────────────────┘
         ^                            │
         │                            v
         │                    ProcessPayment
         │                            │
         │                    ┌───────┴────────┐
         │                    v                v
         │             PaymentProcessed  PaymentFailed
         │                    │
         │                    v
         │          ┌─────────────────────┐
         │          │ Inventory Service   │
         │          └─────────────────────┘

         └────────────────────────────────┘
              (Compensating Events)

Example: Order Processing with Choreography

Step 1: Order Service receives order request

class OrderService {
  async createOrder(orderId: string, items: OrderItem[]): Promise<void> {
    // Create order in database
    await db.orders.insert({
      id: orderId,
      status: 'PENDING',
      items: items,
      createdAt: new Date()
    });

    // Publish event
    await eventBus.publish(new OrderCreatedEvent(orderId, items));
  }
}

Step 2: Payment Service subscribes to OrderCreated

class PaymentService {
  constructor(eventBus: EventBus) {
    eventBus.subscribe(OrderCreatedEvent, this.onOrderCreated.bind(this));
  }

  async onOrderCreated(event: OrderCreatedEvent): Promise<void> {
    try {
      const amount = calculateTotal(event.items);
      await db.payments.insert({
        orderId: event.orderId,
        amount: amount,
        status: 'PROCESSING'
      });

      const result = await paymentGateway.charge(event.orderId, amount);

      await db.payments.update(event.orderId, { status: 'COMPLETED' });
      await eventBus.publish(new PaymentProcessedEvent(event.orderId));
    } catch (error) {
      await db.payments.update(event.orderId, { status: 'FAILED' });
      await eventBus.publish(new PaymentFailedEvent(event.orderId));
    }
  }
}

Step 3: Inventory Service subscribes to PaymentProcessed

class InventoryService {
  constructor(eventBus: EventBus) {
    eventBus.subscribe(PaymentProcessedEvent, this.onPaymentProcessed.bind(this));
    eventBus.subscribe(PaymentFailedEvent, this.onPaymentFailed.bind(this));
  }

  async onPaymentProcessed(event: PaymentProcessedEvent): Promise<void> {
    // Fetch order details to know what to reserve
    const order = await orderService.getOrder(event.orderId);

    try {
      for (const item of order.items) {
        const reserved = await db.inventory.reserve(item.productId, item.quantity);
        if (!reserved) {
          throw new Error(`Cannot reserve ${item.quantity} units of ${item.productId}`);
        }
      }
      await eventBus.publish(new InventoryReservedEvent(event.orderId));
    } catch (error) {
      // Compensate: Release any partial reserves
      for (const item of order.items) {
        await db.inventory.release(item.productId, item.quantity);
      }
      // Trigger compensation in Payment Service
      await eventBus.publish(new InventoryReservationFailedEvent(event.orderId));
    }
  }

  async onPaymentFailed(event: PaymentFailedEvent): Promise<void> {
    // Nothing to compensate here; payment was never charged
  }
}

Step 4: Payment Service handles compensation

class PaymentService {
  constructor(eventBus: EventBus) {
    eventBus.subscribe(InventoryReservationFailedEvent, this.onInventoryFailed.bind(this));
  }

  async onInventoryFailed(event: InventoryReservationFailedEvent): Promise<void> {
    // Refund the payment
    const payment = await db.payments.get(event.orderId);
    await paymentGateway.refund(event.orderId, payment.amount);

    await db.payments.update(event.orderId, { status: 'REFUNDED' });
    await eventBus.publish(new PaymentRefundedEvent(event.orderId));
  }
}

Step 5: Order Service handles compensation

class OrderService {
  constructor(eventBus: EventBus) {
    eventBus.subscribe(PaymentRefundedEvent, this.onPaymentRefunded.bind(this));
  }

  async onPaymentRefunded(event: PaymentRefundedEvent): Promise<void> {
    // Cancel the order
    await db.orders.update(event.orderId, { status: 'CANCELLED' });
    await eventBus.publish(new OrderCancelledEvent(event.orderId));
  }
}

Choreography Pros and Cons

AspectChoreography
CouplingLoosely coupled; services publish events, others subscribe
ScalabilityEasily scales; add new services by subscribing to events
DebuggingDifficult; the flow is implicit, spread across many services
Cyclic DependenciesProne to cyclic event dependencies (e.g., A publishes event → B reacts → publishes event → A reacts)
ObservabilityHard to track saga progress; need distributed tracing tools
TestingComplex; requires mocking event bus and handling async flows

Orchestration-Based Sagas

In orchestration, a central Saga Orchestrator drives the flow. It’s a state machine that knows the happy path and failure paths.

Architecture

┌────────────────────────────────┐
│   Saga Orchestrator            │
│  (State Machine)               │
├────────────────────────────────┤
│ States: PENDING, PAYMENT_DONE, │
│ INVENTORY_DONE, SHIPPING_DONE, │
│ COMPENSATING, FAILED           │
└────────┬───────────────────────┘
         │ Command
         v
    ┌────────────────┬──────────────┬──────────────┐
    v                v              v              v
Order Service   Payment Service  Inventory Srv  Shipping Srv

Example: Using Temporal.io

Temporal.io is a workflow orchestration engine that makes saga implementation natural:

// Define the saga workflow
import * as wf from '@temporalio/workflow';
import { PaymentService, InventoryService, ShippingService } from './activities';

const { createOrder, processPayment, reserveInventory, scheduleShipping } = wf.proxyActivities<{
  createOrder(orderId: string, items: OrderItem[]): Promise<OrderRecord>;
  processPayment(orderId: string, amount: number): Promise<PaymentResult>;
  reserveInventory(orderId: string, items: OrderItem[]): Promise<void>;
  scheduleShipping(orderId: string): Promise<ShippingResult>;
}>({
  startToCloseTimeout: '1 minute',
  retry: { maximumAttempts: 3 }
});

export async function orderSaga(orderId: string, items: OrderItem[]): Promise<OrderSagaResult> {
  const compensations: (() => Promise<void>)[] = [];

  try {
    // Step 1: Create order
    const order = await createOrder(orderId, items);
    compensations.push(() => cancelOrder(orderId));

    // Step 2: Process payment
    const amount = calculateTotal(items);
    const paymentResult = await processPayment(orderId, amount);
    compensations.push(() => refundPayment(orderId, amount));

    // Step 3: Reserve inventory
    await reserveInventory(orderId, items);
    compensations.push(() => releaseInventory(orderId, items));

    // Step 4: Schedule shipping
    const shippingResult = await scheduleShipping(orderId);

    return { success: true, order, paymentResult, shippingResult };
  } catch (error) {
    // Compensate in reverse order
    for (const compensation of compensations.reverse()) {
      try {
        await compensation();
      } catch (compError) {
        // Log but continue; we've already failed the saga
        wf.log.error('Compensation failed', compError);
      }
    }

    throw new SagaFailedError(`Order saga failed for ${orderId}: ${error.message}`);
  }
}

// Define compensating activities
async function cancelOrder(orderId: string): Promise<void> {
  // Call Order Service to cancel
}

async function refundPayment(orderId: string, amount: number): Promise<void> {
  // Call Payment Service to refund
}

async function releaseInventory(orderId: string, items: OrderItem[]): Promise<void> {
  // Call Inventory Service to release
}

Example: Manual State Machine Orchestrator

If you’re not using a workflow engine, you can implement an orchestrator as a state machine:

class OrderSagaOrchestrator {
  private sagaState: Map<string, SagaState> = new Map();

  async execute(orderId: string, items: OrderItem[]): Promise<void> {
    this.sagaState.set(orderId, {
      id: orderId,
      status: 'PENDING',
      items: items,
      createdAt: new Date(),
      completedSteps: []
    });

    try {
      await this.step1_CreateOrder(orderId);
      await this.step2_ProcessPayment(orderId);
      await this.step3_ReserveInventory(orderId);
      await this.step4_ScheduleShipping(orderId);

      this.sagaState.get(orderId)!.status = 'COMPLETED';
    } catch (error) {
      await this.compensate(orderId);
      throw error;
    }
  }

  private async step1_CreateOrder(orderId: string): Promise<void> {
    const state = this.sagaState.get(orderId)!;
    try {
      await this.orderService.createOrder(orderId, state.items);
      state.completedSteps.push('CREATE_ORDER');
      state.status = 'ORDER_CREATED';
    } catch (error) {
      throw new StepFailedError('CREATE_ORDER', error);
    }
  }

  private async step2_ProcessPayment(orderId: string): Promise<void> {
    const state = this.sagaState.get(orderId)!;
    const amount = calculateTotal(state.items);
    try {
      await this.paymentService.processPayment(orderId, amount);
      state.completedSteps.push('PROCESS_PAYMENT');
      state.status = 'PAYMENT_PROCESSED';
    } catch (error) {
      throw new StepFailedError('PROCESS_PAYMENT', error);
    }
  }

  private async step3_ReserveInventory(orderId: string): Promise<void> {
    const state = this.sagaState.get(orderId)!;
    try {
      await this.inventoryService.reserve(orderId, state.items);
      state.completedSteps.push('RESERVE_INVENTORY');
      state.status = 'INVENTORY_RESERVED';
    } catch (error) {
      throw new StepFailedError('RESERVE_INVENTORY', error);
    }
  }

  private async step4_ScheduleShipping(orderId: string): Promise<void> {
    const state = this.sagaState.get(orderId)!;
    try {
      await this.shippingService.scheduleShipping(orderId);
      state.completedSteps.push('SCHEDULE_SHIPPING');
      state.status = 'SHIPPING_SCHEDULED';
    } catch (error) {
      throw new StepFailedError('SCHEDULE_SHIPPING', error);
    }
  }

  private async compensate(orderId: string): Promise<void> {
    const state = this.sagaState.get(orderId)!;
    state.status = 'COMPENSATING';

    // Compensate in reverse order
    const reversedSteps = state.completedSteps.reverse();

    for (const step of reversedSteps) {
      try {
        switch (step) {
          case 'SCHEDULE_SHIPPING':
            await this.shippingService.cancelShipping(orderId);
            break;
          case 'RESERVE_INVENTORY':
            await this.inventoryService.release(orderId);
            break;
          case 'PROCESS_PAYMENT':
            await this.paymentService.refund(orderId);
            break;
          case 'CREATE_ORDER':
            await this.orderService.cancel(orderId);
            break;
        }
      } catch (error) {
        // Log but continue
        console.error(`Compensation failed for step ${step}:`, error);
      }
    }

    state.status = 'COMPENSATED';
  }
}

Orchestration Pros and Cons

AspectOrchestration
CouplingTightly coupled to orchestrator; services are command-driven
Flow ClarityClear; the entire flow is visible in one place (orchestrator logic)
DebuggingEasier; you can inspect orchestrator state and understand the flow
Cyclic DependenciesAvoided; orchestrator controls the flow unidirectionally
ObservabilityGood; orchestrator state is a single source of truth
Single Point of FailureOrchestrator becomes critical; must be highly available
TestingSimpler; mock services and test orchestrator logic

Critical Challenges in Sagas

Idempotency

Each saga step must be idempotent. If a step is retried due to a transient failure or network timeout, it must produce the same result as the first attempt.

Problem: A payment service receives “ProcessPayment” command. It charges the customer. But the response is lost in the network. The orchestrator retries. The payment service charges again.

Solution: Each service must check if the step was already completed before executing it again.

async processPayment(orderId: string, amount: number): Promise<PaymentResult> {
  // Check if already processed
  const existingPayment = await db.payments.findByOrderId(orderId);
  if (existingPayment) {
    return existingPayment;
  }

  // Process payment
  const result = await paymentGateway.charge(orderId, amount);

  // Store result
  await db.payments.insert({ orderId, amount, result });

  return result;
}

The ACD Guarantee (Not ACID)

Sagas don’t provide ACID atomicity. Instead, they provide ACD:

  • Atomicity: Relaxed. Either all steps complete or all are compensated, but intermediate states are visible.
  • Consistency: Eventually achieved through compensating transactions.
  • Isolation: Not guaranteed. Concurrent sagas might see partial results.
  • Durability: Yes, each step is durable once committed.

Dirty reads and lost updates: If Saga A reads inventory count (100 units available) and Saga B reads the same count, both might think they can reserve, leading to overselling.

Mitigation: Semantic locking. The Inventory Service increments a “reserved” counter when a saga reserves inventory, decrementing only when compensated or when the saga completes. Clients read available = total - reserved to see a realistic count.

async reserveInventory(orderId: string, items: OrderItem[]): Promise<void> {
  for (const item of items) {
    await db.inventory.increment(item.productId, { reserved: item.quantity });

    // Record the reservation for potential compensation
    await db.reservations.insert({
      orderId,
      productId: item.productId,
      quantity: item.quantity,
      status: 'RESERVED'
    });
  }
}

async getRealTimeAvailability(productId: string): Promise<number> {
  const product = await db.products.get(productId);
  return product.total - product.reserved;
}

Observability and Tracing

Sagas span multiple services. Debugging a failed saga requires correlating logs and traces across the entire system.

Solution: Use distributed tracing with a consistent correlation ID.

const correlationId = uuid();

// Pass through all service calls
await paymentService.processPayment(orderId, amount, { correlationId });
await inventoryService.reserve(orderId, items, { correlationId });

// Each service logs with correlation ID
logger.info('Processing payment', { correlationId, orderId, amount });

Partial Failure Visibility

If a saga partially completes before failing, the user experiences an inconsistent state. The customer’s payment was charged, but their order was never shipped.

Solution: Implement a saga status API that users (or their clients) can poll to understand the current state. Show clear messaging: “Your order is being processed. We’ve charged your payment, but we’re still reserving inventory…”


Choreography vs. Orchestration: Decision Matrix

CriterionChoreographyOrchestration
Saga ComplexitySimple (2–3 steps)Complex (4+ steps)
Step DependenciesLoose, independentStrict ordering required
Team StructureMultiple independent teamsCentralized team or shared domain
PerformanceLower latency (parallel-friendly)Higher latency (sequential steps)
Failure DebuggingHardEasier
ScalabilityScales naturallyOrchestrator becomes bottleneck
Technology FitEvent-driven architectureWorkflow orchestration systems

When Sagas Are Overkill

Not every distributed operation needs a saga:

  • Simple read-only queries: No transactions needed.
  • Monolithic systems: Use traditional ACID transactions.
  • Synchronous RPC calls with tight coupling: If you’re already tightly coupled, 2PC might be acceptable for small, fast operations.
  • Operations with strict consistency requirements: Sagas offer eventual consistency. If you need strong consistency, consider redesigning your domain.

Key Takeaways

  • Sagas coordinate local transactions across microservices using compensating transactions, replacing slow and fragile 2PC protocols.
  • Two coordination styles exist: Choreography (event-driven, decentralized) and Orchestration (command-driven, centralized). Choose based on saga complexity and team structure.
  • Compensating transactions are new actions, not database rollbacks, giving you semantic flexibility and audit visibility.
  • Idempotency is mandatory for all saga steps; transient failures will cause retries, and your services must handle them gracefully.
  • Sagas offer ACD guarantees, not ACID: Eventually consistent, with visibility into partial states. Mitigate with semantic locking and polling APIs.
  • Observability is critical: Use distributed tracing with correlation IDs to debug failures across multiple services.

Practice Scenarios

Scenario 1: Hotel Booking Saga You’re designing a hotel booking system where a customer reserves a room, pays a deposit, and requests amenities (late checkout, early checkin). The Room Service, Payment Service, and Amenities Service are independent. Design an orchestration-based saga. What happens if the Amenities Service fails? Should the entire saga be compensated, or is a partial success acceptable?

Scenario 2: Detecting Saga Cycles In a choreography-based saga, you have OrderService (publishes OrderCreated), PaymentService (publishes PaymentProcessed), and InventoryService (publishes InventoryReserved). Later, a developer adds a new requirement: when InventoryReserved is published, NotificationService sends an email, and to complete an audit trail, it publishes a NotificationSent event. OrderService listens to NotificationSent and… publishes OrderCreated again (to increment a counter). How would you detect and prevent this cycle?

Scenario 3: Idempotency with External Services A Shipping Service sends shipment details to a third-party logistics provider via REST API. The provider’s API is idempotent (same request always returns the same shipment ID), but the connection is unreliable. How would you ensure your saga step is idempotent even if you can’t rely on the external service’s response?


Next Steps

Understanding sagas prepares us for compensating transactions in the next section, where we dive deeper into designing reversible operations and handling compensating transaction failures. We’ll explore patterns like the Saga Transaction Log, which persists saga state to enable recovery after crashes, and the Transactional Outbox pattern, which ensures events are published reliably even if the publishing service crashes.