System Design Fundamentals

Message Delivery Guarantees

A

Message Delivery Guarantees

The Payment Processing Problem

Imagine you’re running an e-commerce platform processing payment events from a message queue. A customer completes a purchase for $99.99. A message enters your queue: {"payment_id": "pay_123", "amount": 99.99, "user_id": "user_456"}.

Now, here’s where the nightmare begins.

If your system loses this message, the customer never gets charged. Your revenue disappears. The customer is happy, but your accounting department is not. If your system processes this message twice, the customer gets charged $199.98. They call support. They demand a refund. You process the refund, which generates another message, which might also be processed twice. Welcome to the cascading failure scenario that keeps engineers awake at night.

The delivery guarantee you choose for your message queue determines which of these failure modes your system is exposed to. It’s one of the most consequential architectural decisions in event-driven systems, sitting right at the intersection of data consistency, user experience, and operational complexity.

Understanding the Three Guarantees

Message delivery guarantees exist on a spectrum of reliability versus latency. Let’s be clear about what each one means.

At-Most-Once: Speed Over Safety

At-most-once is the “fire and forget” guarantee. Your producer sends a message and doesn’t wait for any confirmation. The broker receives it (maybe) and processes it (maybe). If anything fails—network hiccup, broker crash, consumer dying mid-process—the message is lost forever.

The trade-off is simplicity and speed. You get minimal latency, maximum throughput, and the smallest memory footprint. Perfect when occasional loss doesn’t matter.

Use at-most-once for: metrics aggregation, analytics events, non-critical logging, monitoring dashboards, clickstream data where a few dropped events won’t change your insights.

Never use at-most-once for: financial transactions, critical business events, inventory updates, user account changes, or anything where loss means lost revenue or trust.

At-Least-Once: The Production Standard

At-least-once is the pragmatic middle ground. Your producer sends a message, waits for the broker to confirm it’s persisted. The consumer processes the message and sends an acknowledgment back. Only after that ack does the broker remove the message from the queue.

The guarantee is simple: every message will be delivered to the consumer at least one time. But “at least once” means “possibly multiple times.” If the consumer crashes after processing but before sending the ack, the broker will redeliver the same message when the consumer restarts. If network failures occur between consumer and broker, you get retries and duplicates.

Here’s the critical insight: at-least-once plus idempotent processing equals effectively exactly-once. More on that in a moment.

At-least-once is the most common choice in production systems because it handles the majority of failure scenarios without the complexity of true exactly-once processing.

Exactly-Once: The Holy Grail

Exactly-once means each message is processed by the consumer exactly one time, regardless of network failures, crashes, or retries. No duplicates. No losses.

Here’s the uncomfortable truth: true exactly-once is theoretically impossible to achieve end-to-end over unreliable networks. This is called the Two Generals’ Problem (or Byzantine Generals’ Problem in some contexts). If you send a message to a consumer and wait for an ack, but the ack itself gets lost, you don’t know if the consumer processed the message or not. So you retry. Now the consumer might process it twice. To prevent duplicates, you need the consumer to know if it already processed this message. But that knowledge needs to be persisted somewhere, and if that persistence fails… you’re back to the same problem.

What we can achieve is effectively exactly-once, which combines at-least-once delivery with idempotent processing. The message might be delivered multiple times, but your system is built so that processing it multiple times has the same effect as processing it once.

Some systems like Kafka Streams offer what they call “exactly-once semantics,” which is really a carefully orchestrated exactly-once within the system, combining idempotent producers, transactional consumers, and read-committed isolation levels.

A Practical Analogy

Think of sending important legal documents to a government office.

At-most-once is like regular postal mail. You drop the envelope in the mailbox and walk away. It might arrive, it might get lost, but you’ll never receive two copies. Occasionally losing a document is unacceptable for legal purposes.

At-least-once is like certified mail with signature confirmation. The post office requires the recipient to sign. If they don’t sign within a few days, the post office re-sends the letter. This works great, except there’s a failure mode: what if the recipient signs, but the confirmation letter back to you gets lost? You don’t know they signed, so you send it again. Now they have two copies. To handle this, the recipient needs an assistant who checks a file cabinet: “Did we already receive a document from sender X with document ID 12345? If yes, ignore it.” That file cabinet is your idempotency storage.

Exactly-once is like hiring an expensive legal courier who hand-delivers the document, gets a signed receipt with the document ID, and checks a government database before delivery: “Has document 12345 been delivered already?” This is expensive (high latency, high complexity) but guarantees exactly one delivery with no duplicates.

How Each Guarantee Works

At-Most-Once in Practice

In Kafka, at-most-once is achieved with acks=0:

from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    acks=0,  # Producer doesn't wait for any confirmation
)

producer.send('payments', {'payment_id': 'pay_123', 'amount': 99.99})
producer.flush()

The producer sends the message and immediately returns. The message might not even reach the broker. No retry logic. No waiting. Pure speed.

On the consumer side, you’d process the message and commit your offset immediately, even though you don’t know if processing succeeded:

from kafka import KafkaConsumer

consumer = KafkaConsumer(
    'payments',
    bootstrap_servers=['localhost:9092'],
    enable_auto_commit=True,  # Dangerous! Commits before processing
)

for message in consumer:
    process_payment(message)  # If this crashes, offset already committed

The danger is obvious: if process_payment() crashes, the offset is already committed, and the message is lost forever.

At-Least-Once: The Standard Approach

At-least-once requires more ceremony. The producer waits for confirmation:

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    acks='all',  # Wait for leader and all replicas
    retries=3,   # Retry up to 3 times
)

# This call blocks until the broker confirms
future = producer.send('payments', {'payment_id': 'pay_123', 'amount': 99.99})
try:
    record_metadata = future.get(timeout=10)
    print(f"Message sent to partition {record_metadata.partition}")
except Exception as e:
    print(f"Failed to send message: {e}")

The consumer processes and manually commits only after successful processing:

consumer = KafkaConsumer(
    'payments',
    bootstrap_servers=['localhost:9092'],
    enable_auto_commit=False,  # Manual commit
    auto_offset_reset='earliest'
)

for message in consumer:
    try:
        process_payment(message)
        consumer.commit()  # Commit only after successful processing
    except Exception as e:
        # Don't commit; message will be redelivered
        logger.error(f"Failed to process: {e}")

If process_payment() crashes before the commit() call, the consumer restarts, and the message is redelivered. The system now guarantees delivery, but we need to handle duplicates.

Handling Duplicates: Idempotency Keys

The solution is idempotency: your consumer should be able to process the same message multiple times with the same end result as processing it once.

For payments, you might have already processed this specific payment:

def process_payment(message):
    payment_id = message.value.get('payment_id')
    amount = message.value.get('amount')

    # Check if we've already processed this payment
    existing = db.query(
        "SELECT id FROM processed_payments WHERE payment_id = %s",
        payment_id
    )

    if existing:
        logger.info(f"Payment {payment_id} already processed, skipping")
        return

    # Process the payment
    charge_customer(amount)

    # Mark as processed (with the payment_id as idempotency key)
    db.execute(
        "INSERT INTO processed_payments (payment_id, amount) VALUES (%s, %s)",
        payment_id, amount
    )

The payment_id is your idempotency key. Even if the message is delivered five times, you only charge the customer once. This is the pattern used in production systems everywhere.

For AWS SQS, the idempotency key is called a MessageDeduplicationId:

from boto3 import client

sqs = client('sqs')

response = sqs.send_message(
    QueueUrl='https://sqs.us-east-1.amazonaws.com/123456789/payments.fifo',
    MessageBody='{"payment_id": "pay_123", "amount": 99.99}',
    MessageGroupId='user_456',
    MessageDeduplicationId='pay_123',  # Idempotency key
)

Effectively Exactly-Once: The Transactional Outbox Pattern

True exactly-once across a database and a message queue is hard because they can’t transact together. But we can approximate it with the transactional outbox pattern:

  1. Write your business data (e.g., order) to your database
  2. Write a record to an “outbox” table in the same transaction
  3. A separate process (CDC: Change Data Capture) reads the outbox and publishes to the message queue
  4. Only after the message is confirmed published does the outbox record get marked as sent
def create_order(order_data):
    with db.transaction():
        # Write order
        order = Order.create(**order_data)

        # Write to outbox in same transaction
        Outbox.create(
            event_type='order.created',
            aggregate_id=order.id,
            payload=json.dumps({'order_id': order.id, 'amount': order.total}),
            published=False
        )
        db.commit()

    return order

# Separate process (CDC worker):
def publish_outbox_events():
    unpublished = Outbox.query.filter(Outbox.published == False)

    for event in unpublished:
        try:
            kafka_producer.send('order-events', event.payload)
            event.published = True
            db.commit()
        except Exception as e:
            logger.error(f"Failed to publish: {e}")
            # Will retry next time

This ensures that your business data and the outbox are always in sync (atomicity). If the CDC worker crashes, it will publish again—but the consumer must be idempotent.

Comparison Table

AspectAt-Most-OnceAt-Least-OnceExactly-Once
Message LossPossibleNoNo
DuplicatesNoPossibleNo
LatencyMinimal (acks=0)Medium (acks=all)High (requires coordination)
ThroughputMaximumGoodLower
Implementation ComplexitySimpleModerate (idempotency)High (transactions/CDC)
When to UseMetrics, logs, non-critical eventsMost production use casesFinancial transactions, critical state
Consumer Idempotency Required?NoYesYes

A Timeline: What Happens During a Consumer Crash

Let’s trace through at-least-once during a crash:

T1:  Producer sends message: {"payment_id": "pay_123"}
T2:  Broker persists message, sends ack
T3:  Producer receives ack
T4:  Consumer polls message
T5:  Consumer calls process_payment()
T6:  process_payment() completes successfully
T7:  Consumer tries to commit offset...
T8:  CRASH! Consumer dies before commit() returns
T9:  Broker keeps message (offset not updated)
T10: Consumer restarts
T11: Consumer polls again, gets the same message
T12: process_payment() called again with same payment_id
T13: Idempotency check: "pay_123" already in processed_payments table
T14: Safely skip processing (or log as duplicate)
T15: Consumer commits offset
T16: Broker removes message

Without the idempotency key, the payment would be processed twice. With it, the system remains consistent.

Practical Considerations: Where Exactly-Once Breaks Down

The challenge with “true” exactly-once is the coordinator problem. In Kafka Streams, achieving exactly-once requires:

  • Idempotent producer: Same message sent multiple times produces one effect
  • Transactional writes: State changes and offset commits happen atomically
  • Read-committed isolation: Only consume committed messages

This works within Kafka Streams. But your downstream system? If you write to an external database, you’re relying on that database’s transactional capabilities. If you call an external API, you need an idempotency key. You’re back to at-least-once with idempotency.

Pro Tip

The pragmatic approach wins almost every time: use at-least-once with idempotent consumers. It’s 95% as reliable as true exactly-once, with 50% less complexity. Reserve true exactly-once infrastructure for critical payment or billing systems where the operational cost is justified.

Key Takeaways

  • At-most-once (fire and forget) loses messages but never duplicates. Use for non-critical data like metrics or logs.
  • At-least-once (with acks and retries) never loses messages but can deliver duplicates. Most common in production.
  • Idempotent processing (checking if a message was already processed using a unique key) effectively eliminates duplicates, making at-least-once safe for critical operations.
  • Exactly-once is theoretically impossible over unreliable networks (Two Generals’ Problem), but can be effectively achieved through careful architectural patterns like the transactional outbox.
  • Consumer offset management (auto-commit vs manual) directly impacts your loss/duplication exposure. Manual commit after processing is safer.
  • Different guarantees for different use cases: at-most-once for analytics, at-least-once for notifications, effectively exactly-once (outbox + idempotency) for payments.

Practice Scenarios

Scenario 1: E-Commerce Notification Service

You’re building a system that sends email notifications when orders are placed. Your message queue delivers notifications at-least-once. What happens if an email is sent twice to the customer? How would you detect and prevent this? Is deduplication necessary, or is at-most-once acceptable here?

Scenario 2: Inventory Management

An order message reaches your inventory service. The service needs to decrement the stock count in the database. The broker has at-least-once delivery. If the message is delivered twice, stock gets decremented twice, causing inventory to go negative. Design a solution that ensures exactly-once stock deduction using idempotency keys and database constraints.

Scenario 3: The Outbox Pattern

Implement the transactional outbox pattern for a payment processing system. Your producer writes to a payments table and an events outbox table in a single transaction. A CDC worker reads the outbox and publishes to Kafka. What happens if the CDC worker crashes after publishing but before marking the event as sent? Will you get duplicates?

Next Steps: Dead Letter Queues and Error Handling

Now that you understand delivery guarantees, the next challenge is: what happens when a message can’t be processed? Poisoned messages, malformed data, downstream service failures—these require dead letter queues and structured error handling. That’s where we turn next, building on the reliability foundation you’ve established here.