System Design Fundamentals

Message Queue Technologies

A

Message Queue Technologies

Message queues are the backbone of asynchronous communication in distributed systems. They decouple producers from consumers, enabling scalability and resilience. This reference covers the most popular options, their architectures, and when to choose each.

Apache Kafka

Kafka is a distributed event streaming platform that has become the de facto standard for high-volume, low-latency messaging. At its core, Kafka is an append-only commit log.

Architecture & Concepts:

Kafka organizes data into topics, which are divided into partitions. Each partition is an ordered, immutable sequence of messages. Producers write to topics, and consumers read from them via consumer groups. Within a consumer group, each partition is read by exactly one consumer, enabling both parallel processing and guaranteed ordering per partition.

Key Features:

  • Append-only commit log (immutable, highly efficient)
  • Partitioned topics for horizontal scaling
  • Consumer groups for parallel processing
  • High throughput (millions of messages per second)
  • Message retention based on time, size, or compacted logs (not consumption-based)
  • Replication for durability
  • Zero-copy architecture (very fast)
  • Ecosystem: Kafka Streams, Kafka Connect, Schema Registry

Delivery Semantics:

  • At-least-once (default): Messages may be reprocessed if a consumer fails
  • Exactly-once: Requires idempotent consumer logic or transactions
  • At-most-once: Messages may be lost (rarely used)

When to Use: When you need high throughput, when you want to replay messages, when you need multiple consumers of the same data, when building event-driven architectures or real-time analytics.

Typical Use Cases:

  • Event sourcing (maintaining an immutable event log)
  • Log aggregation (centralized logging across many services)
  • Stream processing (real-time transformations via Kafka Streams)
  • Change Data Capture (CDC) from databases
  • User activity tracking
  • Metrics collection

Considerations: Operational complexity is moderate to high. Requires cluster management, topic configuration, and consumer lag monitoring. Minimum viable setup is more complex than simple queues. Not ideal for low-latency RPC patterns.

Pro Tip: Use Kafka when you care about the history of your data. Use simpler queues when you just need work distribution.

RabbitMQ

RabbitMQ is a traditional message broker implementing the AMQP protocol. It’s older than Kafka but remains popular for its flexibility and rich routing capabilities.

Architecture & Concepts:

RabbitMQ routes messages through exchanges to queues. Producers send messages to exchanges, which route them based on the binding configuration. This separation of producers from the routing logic enables sophisticated messaging patterns.

Exchange Types:

  • Direct: Routes by exact routing key match (good for RPC, task queues)
  • Fanout: Routes to all bound queues (good for publish-subscribe)
  • Topic: Routes by wildcard pattern matching (good for topic-based subscriptions)
  • Headers: Routes by message headers (flexible but slower)

Key Features:

  • Flexible exchange-based routing
  • Message acknowledgments and deadletter queues
  • Priority queues
  • Per-queue TTL and message TTL
  • Clustering for high availability
  • Management UI
  • Plugins ecosystem (auth, federation, delayed exchanges)
  • Reasonable throughput (hundreds of thousands to millions per second)

When to Use: When you need complex routing patterns, when you want rich messaging semantics, when you prefer a traditional broker model.

Typical Use Cases:

  • Task queues (distributing work to workers)
  • RPC patterns (request-reply)
  • Event distribution with complex routing
  • Notification systems

Considerations: Lower throughput than Kafka. Brokers are more stateful, making clustering more complex. Not designed for high-volume event streaming or replaying data.

Pro Tip: RabbitMQ excels at task distribution. Use it when your consumers are workers and your producers are requesters.

Amazon SQS

SQS is AWS’s fully managed queue service. You pay per message, don’t manage infrastructure, and AWS handles scaling and durability.

Queue Types:

Standard Queues:

  • At-least-once delivery (messages may be delivered multiple times)
  • Best-effort ordering (order not guaranteed)
  • Unlimited throughput
  • Good for distributed work distribution

FIFO Queues:

  • Exactly-once processing (no duplicates)
  • Strict first-in-first-out ordering
  • Limited to 300 messages per second (batching can increase throughput)
  • Good when order and uniqueness matter

Key Features:

  • Fully managed (serverless)
  • Dead letter queues (messages that can’t be processed)
  • Message visibility timeout (prevents duplicate processing)
  • Long polling (reduces API calls)
  • Integrates with AWS Lambda (triggers automatically)
  • Fine-grained access control (IAM)
  • No message replay (messages deleted after consumption)

When to Use: AWS-native architectures, when you want zero operational overhead, when you’re comfortable with limited ordering/delivery guarantees, when integrating with Lambda.

Typical Use Cases:

  • Decoupling AWS services (Lambda to Lambda, EC2 to EC2)
  • Serverless architectures with Lambda workers
  • Task scheduling
  • Email/notification queues

Considerations: No message replay (no event history). At-least-once requires idempotent consumers. Standard queues have eventual consistency. Pricing can grow quickly at large scale.

Pro Tip: Use SQS for fire-and-forget work distribution. Use SNS+SQS pattern for fan-out to multiple targets.

Amazon SNS

SNS is AWS’s pub/sub service. Producers publish to topics, and SNS delivers to multiple subscribers.

Subscribers can be:

  • SQS queues (SNS+SQS fan-out pattern)
  • Lambda functions (serverless processing)
  • HTTP/HTTPS endpoints (webhooks)
  • Email addresses
  • SMS numbers
  • Mobile push notifications

Key Features:

  • Fully managed
  • Fan-out to multiple subscribers
  • Message filtering (subscribers filter by attributes)
  • Message deduplication (within 5-minute window)
  • FIFO topics (strict ordering and exactly-once)
  • Integrates seamlessly with SQS

When to Use: When one message needs to reach multiple targets, when you want serverless pub/sub, when combining with SQS for resilience.

Typical Use Cases:

  • Fan-out from one service to many
  • Event notifications
  • Alerts and monitoring
  • Multi-channel notifications (email + SMS + push)

Considerations: Limited message retention (none—subscribers must be ready immediately). Best used in combination with SQS for durability. Less flexible than RabbitMQ exchanges.

The SNS+SQS Pattern: This is a common AWS architecture: SNS publishes to multiple SQS queues, each with its own consumer. This gives you fan-out (like Kafka topics) with decoupled processing and replay via SQS visibility timeout and Dead Letter Queues.

Apache Pulsar

Pulsar is a newer messaging platform designed to overcome Kafka’s limitations. It separates compute (brokers) from storage (BookKeeper), enabling horizontal scaling and multi-tenancy.

Key Features:

  • Separation of compute and storage (scales independently)
  • Multi-tenancy built-in (isolate workloads by tenant)
  • Both queue and pub/sub semantics in one system
  • Geo-replication (built-in across regions)
  • Exactly-once semantics
  • Tiered storage (hot/warm/cold data)
  • Schema management
  • Functions (serverless compute in Pulsar)

When to Use: When you need Kafka-like scalability but want better multi-tenancy, when you want built-in geo-replication, when you need both queue and pub/sub semantics.

Typical Use Cases:

  • Large-scale event streaming in multi-tenant platforms
  • Geo-distributed event systems
  • Hybrid queue/pub-sub workloads

Considerations: Smaller community than Kafka. Operational complexity is moderate to high. Less operational knowledge available in the industry. Growing but not yet as battle-tested as Kafka.

Pro Tip: Pulsar is “Kafka done right” with hindsight. If you’re starting a new system and willing to manage complexity, Pulsar is worth evaluating.

NATS

NATS is a lightweight, high-performance pub/sub system designed for microservices and edge computing.

Key Features:

  • Minimal overhead (very fast, low latency)
  • Simple pub/sub model
  • JetStream for persistence (durability like Kafka)
  • Request-reply pattern (built-in RPC)
  • Subject-based routing (flexible like RabbitMQ)
  • Good for IoT and edge computing
  • Single binary (easy to deploy)

When to Use: Microservices on Kubernetes, IoT/edge scenarios, when you want simplicity with good performance.

Typical Use Cases:

  • Microservices communication
  • IoT device communication
  • Edge computing (proximity to application)
  • Simple event streaming (with JetStream)

Considerations: Smaller ecosystem than Kafka/RabbitMQ. Less widely used in enterprises. Request-reply adds latency vs pure pub/sub.

Pro Tip: NATS is excellent for microservices on Kubernetes. It’s simple, performant, and has minimal resource overhead.

Google Pub/Sub

Google Cloud’s managed pub/sub service. Similar to SNS but with better message durability.

Key Features:

  • Fully managed
  • At-least-once delivery (with deduplication)
  • Global by default (messages distributed to all regions)
  • Serverless (auto-scaling)
  • Integration with Cloud Dataflow for stream processing
  • Snapshots (save and restore subscriber position)
  • Message ordering (optional, per-subscription)
  • Dead letter topics

When to Use: GCP-native architectures, when you want managed pub/sub with good durability, when building serverless applications on Google Cloud.

Typical Use Cases:

  • GCP service integration
  • Event streaming on Google Cloud
  • Serverless event processing (Cloud Functions)
  • Analytics on Google Cloud

Considerations: Google Cloud only. Smaller ecosystem than AWS SQS/SNS. Less community knowledge than open-source alternatives.

Message Queue Comparison Matrix

TechnologyModelDelivery GuaranteeOrderingThroughputManaged OptionBest For
KafkaPub/sub (event stream)At-least-oncePer-partitionMillions/secConfluent Cloud, AWS MSKEvent streaming, high volume, replay
RabbitMQQueue + Pub/subAt-least-onceFIFO (per queue)Hundreds K/secCloudAMQP, PivotalTask queues, complex routing
SQSQueueAt-least-once (standard) or Exactly-once (FIFO)Best-effort (standard) or Strict (FIFO)High (standard), limited (FIFO)AWS SQS (fully managed)AWS service decoupling, serverless
SNSPub/subAt-least-onceNot guaranteedHighAWS SNS (fully managed)Fan-out to multiple targets
PulsarQueue + Pub/subExactly-oncePer-partitionMillions/secStreamNative Cloud, ApacheEvent streaming, multi-tenancy, geo-replication
NATSPub/sub + Request-replyAt-most-once (core) or At-least-once (JetStream)Not guaranteedVery highNATS CloudMicroservices, edge, low latency
Pub/SubPub/subAt-least-onceOptional (per-subscription)HighGoogle Pub/Sub (fully managed)GCP native, serverless

Decision Framework

Choose Kafka if:

  • You have high message volume (millions/second)
  • You want message replay and immutable event history
  • You need to scale consumers horizontally
  • You’re building event streaming or event sourcing
  • You can handle operational complexity

Choose RabbitMQ if:

  • You need complex routing patterns (topic-based, header-based)
  • You have traditional task queue workloads (workers processing jobs)
  • You prefer a proven, stable broker
  • Your message volume is moderate (millions/second is achievable but not its strength)

Choose SQS if:

  • You’re on AWS and want minimal operational overhead
  • You have intermittent, bursty workloads
  • You want integration with Lambda
  • You’re comfortable with eventual consistency
  • You don’t need message replay

Choose SNS if:

  • One message needs to reach multiple distinct targets
  • You want fan-out to different services
  • You’re combining with SQS for durability

Choose Pulsar if:

  • You’re starting a new system that needs Kafka-scale but want better design
  • You need both queue and pub/sub semantics
  • Multi-tenancy is important
  • You’re comfortable with higher operational complexity than Kafka

Choose NATS if:

  • You’re building microservices on Kubernetes
  • Simplicity and low latency are priorities
  • You want a lightweight message broker
  • You’re in edge/IoT scenarios

Choose Google Pub/Sub if:

  • You’re on Google Cloud
  • You want a fully managed pub/sub service
  • Durability and deduplication are important

Delivery Guarantees Explained

At-most-once: Message may be lost. Producer sends once, no retries. Fast but not durable.

At-least-once: Message will reach the consumer at least once. May be delivered multiple times. Requires idempotent consumer logic (apply the same message multiple times = same result).

Exactly-once: Message delivered exactly once. Most expensive to implement (requires coordination between producer and consumer). Some systems claim this but really provide idempotent at-least-once.

Idempotency is your friend. If your consumer can handle receiving the same message twice and produce the same result, you don’t need exactly-once semantics. This is often easier than building exactly-once.

Message Ordering Considerations

Global order: All messages processed in sequence. Limited to one consumer. Simplest but least scalable.

Partition/shard order: Messages within a partition maintain order. Different partitions process in parallel. Kafka, Pulsar, and some RabbitMQ setups provide this.

No ordering guarantee: Messages may arrive out of order. Kafka standard queues and most pub/sub systems offer this. Fastest.

Choose based on your requirements. Most systems don’t need global ordering and scale better with partition-level ordering.

Key Takeaways

  • Kafka is the standard for event streaming and high-volume messaging. Use it when message history matters and you need to replay data.
  • RabbitMQ remains excellent for task queues and complex routing patterns. Choose it for traditional broker workloads.
  • SQS is the AWS default for decoupling services and serverless architectures. Zero operational overhead but limited guarantees.
  • SNS provides fan-out to multiple targets. Often used with SQS for durability and decoupling.
  • Pulsar is Kafka’s spiritual successor with better design. Consider it for new systems if you’re willing to manage complexity.
  • NATS is the lightweight option for microservices and edge computing.
  • Google Pub/Sub is the GCP equivalent to SQS/SNS.

For most new systems, Kafka or managed cloud queues (SQS, Pub/Sub) are safe choices. RabbitMQ is proven for task distribution. NATS is excellent for microservices. Choose based on scale, ordering requirements, and operational comfort.