Compensating Transactions
Introduction
Imagine you’re processing an order in our e-commerce system. The saga orchestrator coordinates five services: order service commits the order, payment service charges the card, inventory service decrements stock, notification service sends a confirmation email, and fulfillment service initiates shipping. Everything succeeds until the payment service fails—but by then, the order has already been created, the email has been sent, and inventory has been deducted.
Now comes the hard question: what does “undo” actually mean?
You might think: delete the order row. But that’s not real-world undo. The customer received the confirmation email. Other services have already processed the event. Downstream systems may have created their own data structures based on this order. Simply deleting a row is like pretending the transaction never happened—which is technically false and violates audit requirements.
Compensating transactions are the answer. Instead of trying to erase history, you move forward with new transactions that semantically reverse the effects of the previous ones. You don’t delete the order—you create a cancellation. You don’t remove the deducted inventory—you add it back. You don’t unsend the confirmation email—you send a cancellation notice. This is fundamentally different from a database rollback, and mastering this pattern is essential for building resilient distributed systems.
Understanding Compensating Transactions
What They Are (and Aren’t)
A compensating transaction is a forward-moving business action that neutralizes the logical effect of a previously committed transaction, even though that transaction remains permanently recorded in the system.
This is crucial: compensating transactions do not erase. They offset. Think of it as double-entry bookkeeping in accounting. When you need to reverse a financial entry, you don’t delete the original entry—you create a new, opposite entry with the same magnitude. Both entries remain in the ledger, but their net effect is zero.
Consider three operational models:
-
Database Rollback (traditional ACID): “I’ll undo this change at the database level.” Example: Your transaction rolls back, the INSERT statement is reverted, the row never existed. Possible only within a single database and a single transaction.
-
Compensating Transaction (distributed systems): “I’ll acknowledge this change happened, record it, and then take a new action that reverses its effect.” Example: The order exists forever, but we execute a cancellation transaction that offsets it.
-
Tentative State (reservation pattern): “I won’t fully commit until I know the full transaction can succeed.” Example: We reserve inventory without deducting it, then confirm or release the reservation. We’ll explore this alternative in the trade-offs section.
Why Simple Rollback Fails
In distributed systems, you cannot just rollback because:
- Independent Commits: Each service already persisted its change to its own database. There’s no global rollback mechanism.
- Side Effects Executed: Emails sent, webhooks fired, third-party API calls completed. These are irreversible without compensations.
- Event Published: The order-created event was published to the message broker. Downstream systems processed it. You can’t unpublish.
- Time Elapsed: Other transactions may have depended on the committed state. Rolling back violates causality.
- Audit Trail: Regulatory requirements demand an immutable record of what actually happened.
Compensating transactions acknowledge this reality: what’s done is done; we can only move forward by doing the opposite.
Compensability Spectrum
Not all operations are equally compensable. Understanding where your operations fall on this spectrum is critical for saga design:
| Operation Type | Compensability | Example | Compensation Strategy |
|---|---|---|---|
| Financial transfer | Fully compensable | Debit account A, credit account B | Debit account B, credit account A (reverse transfer) |
| Inventory decrement | Fully compensable | Decrease stock by 5 units | Increase stock by 5 units |
| Database insert | Fully compensable | Create order record | Create cancellation record (logical delete) |
| Email sent | Partially compensable | Send order confirmation | Send cancellation notice, note original email sent |
| API call to vendor | Partially compensable | Request quote from supplier | Cancel quote request, note it was requested |
| Physical shipment | Non-compensable | Ship goods via carrier | Initiate return/refund process, pay return shipping |
| Customer notification via SMS | Partially compensable | Send SMS alert | Send follow-up SMS (can’t unsend original) |
| Third-party review posted | Non-compensable | Customer posts review on external site | Manual intervention, escalation |
The compensability of an operation depends on whether you can semantically reverse its effect and whether it leaves a business-appropriate audit trail.
Compensation Ordering: The Stack Principle
Compensations must execute in reverse order of the original transactions. This is non-negotiable.
Why? Consider a payment saga: (1) Deduct from customer account, (2) Deposit to merchant account, (3) Log transaction. If payment fails and we compensate in forward order, we might deposit back to merchant before deducting from customer, creating a temporal inconsistency where the merchant sees funds that don’t exist elsewhere.
Reversing the order maintains consistency: (1) Undo the log (safe), (2) Undo merchant deposit, (3) Undo customer deduction. By the time step 3 completes, the system is back in a valid state.
In saga implementations, this is typically managed with a compensation stack: as each transaction succeeds, we push its compensation function onto a stack. If failure occurs, we pop the stack and execute compensations in LIFO order.
// Compensation stack pattern
class SagaOrchestrator {
private compensationStack: Array<() => Promise<void>> = [];
async executeStep(
step: () => Promise<void>,
compensation: () => Promise<void>
) {
await step();
// Push compensation for later if needed
this.compensationStack.push(compensation);
}
async compensate() {
// Execute compensations in reverse order
while (this.compensationStack.length > 0) {
const compensation = this.compensationStack.pop();
await compensation();
}
}
}
A Pen vs. Pencil Analogy
Here’s an intuitive way to think about the difference:
Pencil (Database Rollback): You write something in pencil, then erase it completely. The eraser removes all evidence of what you wrote. By the time you’re done, it’s as if the pencil never touched the paper.
Pen (Compensating Transactions): You write something in pen. You can’t erase it. But if you want to correct it, you cross it out and write a correction note next to it. Both the original and the correction remain visible. An observer reading the whole page understands the intent: the original was written, then corrected.
Distributed systems are pen-based. Every write is permanent. You can only move forward by writing new things.
Technical Design Principles
Principle 1: Define Compensation at Design Time
Every forward-moving action must have a planned compensation defined before it executes. This is not optional.
interface SagaStep {
name: string;
action: (context: SagaContext) => Promise<void>;
compensation: (context: SagaContext) => Promise<void>;
timeout?: number;
retryPolicy?: RetryPolicy;
}
const orderSaga: SagaStep[] = [
{
name: "createOrder",
action: async (ctx) => {
ctx.orderId = await orderService.create(ctx.orderData);
},
compensation: async (ctx) => {
// Create cancellation record, don't delete
await orderService.cancel(ctx.orderId, { reason: "saga_compensation" });
}
},
{
name: "chargePayment",
action: async (ctx) => {
ctx.paymentId = await paymentService.charge(ctx.customerId, ctx.amount);
},
compensation: async (ctx) => {
// Refund the charge
await paymentService.refund(ctx.paymentId);
}
},
{
name: "decrementInventory",
action: async (ctx) => {
await inventoryService.decrement(ctx.sku, ctx.quantity);
},
compensation: async (ctx) => {
// Increment back
await inventoryService.increment(ctx.sku, ctx.quantity);
}
}
];
The discipline of defining compensation upfront forces you to think about failure modes early.
Principle 2: Idempotent Compensations
Compensations must be idempotent: executing the same compensation twice must produce the same result as executing it once, without side effects.
Why? Because compensations themselves can fail. A compensation might be executed, then the process crashes before it’s marked complete. When the system recovers, it needs to re-execute that compensation. If compensation isn’t idempotent, you create corruption.
// Non-idempotent compensation (WRONG)
async function refundPayment(paymentId: string) {
const payment = await db.payments.findById(paymentId);
await stripeAPI.refund(payment.stripeChargeId); // What if called twice?
await db.payments.update(paymentId, { status: "refunded" });
}
// Idempotent compensation (CORRECT)
async function refundPayment(paymentId: string) {
const payment = await db.payments.findById(paymentId);
// Check idempotency key to prevent duplicate refunds
const idempotencyKey = `refund-${paymentId}`;
const existingRefund = await db.refunds.findByIdempotencyKey(idempotencyKey);
if (existingRefund) {
return; // Already refunded, idempotent return
}
// Execute refund with idempotency key (Stripe supports this)
const refund = await stripeAPI.refund(
payment.stripeChargeId,
{ idempotencyKey }
);
// Record refund
await db.refunds.insert({
paymentId,
refundId: refund.id,
idempotencyKey
});
}
Idempotency keys, database uniqueness constraints, and conditional logic are your tools here.
Principle 3: The Pivot Transaction
In longer sagas, there often exists a pivot transaction—a point of no return after which compensation becomes impossible or impractical.
Example: In an order saga, the pivot might be “physical goods dispatched to carrier.” Once the carrier has the package, you can’t simply “cancel” anymore. Compensation switches from cancellation to return-and-refund.
Identifying your pivot is strategic:
const orderSagaWithPivot: SagaStep[] = [
// Pre-pivot: Can cancel easily
{ name: "createOrder", action: ..., compensation: ... },
{ name: "chargePayment", action: ..., compensation: ... },
{ name: "decrementInventory", action: ..., compensation: ... },
// PIVOT: After this, compensation strategy changes
{
name: "dispatchToCarrier",
action: async (ctx) => {
ctx.shipmentId = await fulfillmentService.dispatch(ctx.orderId);
ctx.isPivot = true; // Mark that we've crossed the pivot
},
compensation: async (ctx) => {
// Compensation changes to a return process, not cancellation
await fulfillmentService.initiateReturn(ctx.shipmentId);
// Send return label to customer
// Update order status to "return_in_progress"
}
}
];
Identifying pivots helps teams understand the financial and operational implications of failure.
Principle 4: State Tracking for Compensations
The saga orchestrator must track which steps have completed so it knows which compensations to execute.
interface SagaState {
sagaId: string;
status: "pending" | "in_progress" | "compensating" | "failed" | "completed";
completedSteps: string[]; // Which steps have succeeded
failedStep: string | null; // Which step failed
compensatedSteps: string[]; // Which compensations have been executed
context: SagaContext;
createdAt: Date;
updatedAt: Date;
}
// When compensation occurs:
async function handleSagaFailure(sagaId: string, failedStepName: string) {
const state = await db.sagaStates.findById(sagaId);
// Find which steps succeeded before the failure
const stepsToCompensate = sagaDefinition
.filter(step => state.completedSteps.includes(step.name))
.reverse(); // Reverse order
state.status = "compensating";
for (const step of stepsToCompensate) {
try {
await step.compensation(state.context);
state.compensatedSteps.push(step.name);
} catch (error) {
// Compensation failed—escalate or retry
await escalateToManualIntervention(sagaId, step.name, error);
return;
}
}
state.status = "failed";
await db.sagaStates.update(sagaId, state);
}
Storing saga state in a database ensures you can recover and resume compensation even if the orchestrator crashes.
Principle 5: Handling Compensation Failures
What happens when a compensation itself fails? This is the nightmare scenario.
Example: You’re refunding a payment, but the Stripe API returns an error. The compensation has been partially executed. The customer sees their money hasn’t come back. Your system is now in a corrupted state.
Strategies:
- Retry with Backoff: Compensation failures often are transient. Retry with exponential backoff.
async function refundWithRetry(
paymentId: string,
maxRetries: number = 5
) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
await refundPayment(paymentId);
return;
} catch (error) {
if (attempt === maxRetries - 1) throw error;
const delay = Math.pow(2, attempt) * 1000; // exponential backoff
await sleep(delay);
}
}
}
- Dead Letter Queue: If retries exhaust, send the failed compensation to a DLQ for manual review.
async function compensateWithDLQ(step: SagaStep, context: SagaContext) {
try {
await step.compensation(context);
} catch (error) {
await messageQueue.sendToDeadLetterQueue({
type: "compensation_failed",
step: step.name,
context,
error: error.message,
timestamp: new Date()
});
}
}
-
Manual Intervention: Escalate to a human who can decide next steps (e.g., contact payment processor).
-
Alerting: Fire an alert immediately so ops teams are aware.
The key: assume compensations will fail. Design for graceful degradation, not perfection.
Principle 6: The Outbox Pattern for Reliable Compensation Events
When compensation needs to trigger external actions (like sending a cancellation email), you risk the infamous “notification not sent” problem: the compensation transaction commits, but the email send fails.
The outbox pattern solves this:
// Step 1: Write compensation AND outbox entry in same transaction
async function compensateOrderWithOutbox(orderId: string) {
await db.transaction(async (trx) => {
// Create cancellation record
await trx.orders.update(orderId, { status: "cancelled" });
// Write outbox entry for the notification service
await trx.outbox.insert({
aggregateId: orderId,
aggregateType: "order",
eventType: "order_cancelled",
payload: { orderId, reason: "saga_compensation" },
createdAt: new Date(),
published: false
});
});
}
// Step 2: Separate process polls outbox and publishes events
async function publishOutboxEvents() {
const unpublished = await db.outbox.findWhere({ published: false });
for (const entry of unpublished) {
try {
await eventBus.publish(entry.eventType, entry.payload);
await db.outbox.update(entry.id, { published: true });
} catch (error) {
// Retry later; entry remains in outbox
console.error(`Failed to publish ${entry.eventType}:`, error);
}
}
}
This ensures compensation and notification are decoupled, allowing for retry without duplication.
Complete Order Cancellation Example
Let’s design a full order cancellation saga with all principles applied:
interface OrderCancellationSaga {
sagaId: string;
orderId: string;
customerId: string;
amount: number;
paymentId: string;
shipmentId?: string;
}
class OrderCancellationOrchestrator {
async executeCompensation(saga: OrderCancellationSaga) {
const state = await this.loadState(saga.sagaId);
// Determine compensation strategy based on completed steps
if (!state.completedSteps.includes("chargePayment")) {
// Payment never succeeded, nothing to refund
return;
}
if (state.completedSteps.includes("dispatchToCarrier")) {
// Pivot crossed: use return process, not cancellation
await this.compensateWithReturn(saga);
} else {
// Pre-pivot: use simple cancellation
await this.compensateWithCancellation(saga);
}
}
private async compensateWithCancellation(saga: OrderCancellationSaga) {
// Step 1: Refund payment (idempotent)
await this.refundPaymentIdempotent(saga.paymentId);
// Step 2: Increment inventory back
const order = await orderService.getOrder(saga.orderId);
for (const item of order.items) {
await inventoryService.increment(item.sku, item.quantity);
}
// Step 3: Record cancellation and publish event via outbox
await db.transaction(async (trx) => {
await trx.orders.update(saga.orderId, { status: "cancelled" });
await trx.outbox.insert({
aggregateId: saga.orderId,
eventType: "order_cancelled",
payload: { orderId: saga.orderId, reason: "customer_cancellation" },
published: false
});
});
}
private async compensateWithReturn(saga: OrderCancellationSaga) {
// Initiate return instead of cancellation
const returnId = await fulfillmentService.initiateReturn(
saga.shipmentId,
{ reason: "saga_compensation" }
);
// Store return in saga state for tracking
await db.sagaStates.update(saga.sagaId, {
returnId,
status: "return_in_progress"
});
// Note: refund happens only after return is received
// For now, just mark as "refund_pending"
await db.transaction(async (trx) => {
await trx.orders.update(saga.orderId, {
status: "return_requested",
returnId
});
await trx.outbox.insert({
aggregateId: saga.orderId,
eventType: "return_initiated",
payload: { orderId: saga.orderId, returnId, reason: "saga_compensation" },
published: false
});
});
}
private async refundPaymentIdempotent(paymentId: string) {
const idempotencyKey = `refund-${paymentId}`;
const existingRefund = await db.refunds.findOne({ paymentId });
if (existingRefund) return; // Already refunded
const refund = await paymentProvider.refund(
paymentId,
{ idempotencyKey }
);
await db.refunds.insert({
paymentId,
refundId: refund.id,
idempotencyKey,
amount: refund.amount,
timestamp: new Date()
});
}
}
Trade-offs and Considerations
Design Complexity
Every action needs a defined, tested compensation. For a 10-step saga, you’re really designing 20 flows: happy path + compensation path. This doubles design and testing burden.
Mitigation: Use code generation for simple compensations, establish compensation templates, and invest heavily in automated testing.
Temporal Inconsistency Windows
Between when a forward transaction commits and its compensation completes, the system is in an inconsistent state. A user querying the order status might see “pending” instead of “cancelled” for several seconds or minutes.
Mitigation: Make this window explicit in your SLAs. Design UIs to handle intermediate states gracefully. Use eventual consistency guarantees in your documentation.
Non-Compensable Operations
Some operations are fundamentally non-compensable:
- A physical package already shipped to a customer
- A review posted on an external platform you don’t control
- A customer’s personal data already processed by a vendor
For these, you cannot use compensation-based sagas. Instead:
- Use reservations: Instead of shipping immediately, place the order in “reserved” state, confirm only after all checks pass.
- Use manual workflows: Escalate to humans (customer service team) to handle the reversal.
- Use return processes: Accept that some operations can’t be undone, only reversed through a new process.
Performance Overhead
Maintaining saga state in a database and executing potentially multiple compensation steps adds latency. A simple saga might take 50ms; a compensating saga under failure takes 500ms or more.
Mitigation: Use in-memory state for short sagas, cache frequently accessed compensation logic, and accept that failure paths are slower (they should be rare anyway).
Testing Burden
You must test not just the happy path, but every possible failure point and its corresponding compensation:
- Failure after step 1, 2, 3, … N
- Each failure should trigger the correct compensation chain
- Compensations themselves should be fault-injected to test DLQ handling
This is non-negotiable. Untested compensation paths are time bombs.
Alternatives: Reservations and Tentative State
Instead of compensating after committing, you can design for tentative operations:
// Reservation pattern: don't commit until sure
async function reserveAndThenConfirm(order: Order) {
// Reserve inventory without decrementing
const reservation = await inventory.reserve(
order.items,
{ ttl: 60000 } // 60 second timeout
);
// Try payment (tentative)
const payment = await payment.authorizeOnly(order.customerId, order.amount);
// All checks passed, now commit everything
await db.transaction(async (trx) => {
await inventory.confirm(reservation.id);
await payment.capture(payment.id);
await orders.create(order);
});
}
This trades off some complexity (managing reservations) for less compensation complexity. Choose based on your failure patterns and business requirements.
Key Takeaways
-
Compensating transactions are forward-moving actions that neutralize the logical effect of previously committed transactions, not database rollbacks. History is immutable; you add new entries to offset previous ones.
-
Define compensation at design time. Every forward action must have a planned, tested compensation. This discipline forces you to think about failure modes early.
-
Maintain strict ordering: compensations execute in reverse order of their corresponding forward transactions. A compensation stack (LIFO) is your implementation pattern.
-
Idempotency is non-negotiable for both forward actions and compensations. Expect failures and retries; design so that re-execution doesn’t corrupt state.
-
Identify pivot points in your saga where compensation strategy changes (e.g., after shipment, switch from cancellation to return). Understand the business implications.
-
Assume compensation failures occur. Use retries, dead letter queues, alerts, and manual escalation. Compensation failure is a special case—handle it explicitly.
-
Use the outbox pattern to reliably publish compensation side effects (emails, events) without losing notifications or creating duplicates.
Practice Scenarios
Scenario 1: The Partially Shipped Order
You have an order with 5 items. Three items are shipped; two are in fulfillment when cancellation is requested. How do you compensate? Design the compensation logic considering:
- Which items to return (shipped vs. in-warehouse)
- Whether to refund immediately or after return receipt
- What state transitions the order undergoes
- How to handle a customer disputing the refund
Scenario 2: The Third-Party Vendor Saga
Your marketplace accepts orders but fulfills through third-party vendors. The saga is: (1) charge customer, (2) send order to vendor API, (3) vendor confirms receipt. If the vendor API call fails but payment succeeded, what happens?
Design:
- How to detect this scenario
- The compensation flow
- How to sync state with the vendor
- What if the vendor has already started fulfillment
Scenario 3: The Cascading Failure
A saga executes: reserve inventory, charge payment, notify warehouse, dispatch to carrier. The notification service is down, so compensation begins—but compensating the payment fails (Stripe timeout). The compensation itself goes to DLQ. Now a human needs to intervene. Design:
- How the human sees the saga state
- What information they need to manually compensate
- How to prevent this scenario with better error handling
- Testing strategy for this edge case
Connection to the Next Section
Compensating transactions work only if they’re idempotent. When a compensation retries due to transient failure, it must produce the same end result without side effects. In the next section, we’ll explore idempotency in depth: what it means, how to implement it correctly, and why it’s the foundation of resilient distributed systems. Idempotency transforms compensations from dangerous to safe, and sagas from fragile to rock-solid.