Message Queue Technologies

Message queues are the backbone of asynchronous communication in distributed systems. They decouple producers from consumers, enabling scalability and resilience. This reference covers the most popular options, their architectures, and when to choose each.

Apache Kafka

Kafka is a distributed event streaming platform that has become the de facto standard for high-volume, low-latency messaging. At its core, Kafka is an append-only commit log.

Architecture & Concepts:

Kafka organizes data into topics, which are divided into partitions. Each partition is an ordered, immutable sequence of messages. Producers write to topics, and consumers read from them via consumer groups. Within a consumer group, each partition is read by exactly one consumer, enabling both parallel processing and guaranteed ordering per partition.

Key Features:

Append-only commit log (immutable, highly efficient)
Partitioned topics for horizontal scaling
Consumer groups for parallel processing
High throughput (millions of messages per second)
Message retention based on time, size, or compacted logs (not consumption-based)
Replication for durability
Zero-copy architecture (very fast)
Ecosystem: Kafka Streams, Kafka Connect, Schema Registry

Delivery Semantics:

At-least-once (default): Messages may be reprocessed if a consumer fails
Exactly-once: Requires idempotent consumer logic or transactions
At-most-once: Messages may be lost (rarely used)

When to Use: When you need high throughput, when you want to replay messages, when you need multiple consumers of the same data, when building event-driven architectures or real-time analytics.

Typical Use Cases:

Event sourcing (maintaining an immutable event log)
Log aggregation (centralized logging across many services)
Stream processing (real-time transformations via Kafka Streams)
Change Data Capture (CDC) from databases
User activity tracking
Metrics collection

Considerations: Operational complexity is moderate to high. Requires cluster management, topic configuration, and consumer lag monitoring. Minimum viable setup is more complex than simple queues. Not ideal for low-latency RPC patterns.

Pro Tip: Use Kafka when you care about the history of your data. Use simpler queues when you just need work distribution.

RabbitMQ

RabbitMQ is a traditional message broker implementing the AMQP protocol. It’s older than Kafka but remains popular for its flexibility and rich routing capabilities.

Architecture & Concepts:

RabbitMQ routes messages through exchanges to queues. Producers send messages to exchanges, which route them based on the binding configuration. This separation of producers from the routing logic enables sophisticated messaging patterns.

Exchange Types:

Direct: Routes by exact routing key match (good for RPC, task queues)
Fanout: Routes to all bound queues (good for publish-subscribe)
Topic: Routes by wildcard pattern matching (good for topic-based subscriptions)
Headers: Routes by message headers (flexible but slower)

Key Features:

Flexible exchange-based routing
Message acknowledgments and deadletter queues
Priority queues
Per-queue TTL and message TTL
Clustering for high availability
Management UI
Plugins ecosystem (auth, federation, delayed exchanges)
Reasonable throughput (hundreds of thousands to millions per second)

When to Use: When you need complex routing patterns, when you want rich messaging semantics, when you prefer a traditional broker model.

Typical Use Cases:

Task queues (distributing work to workers)
RPC patterns (request-reply)
Event distribution with complex routing
Notification systems

Considerations: Lower throughput than Kafka. Brokers are more stateful, making clustering more complex. Not designed for high-volume event streaming or replaying data.

Pro Tip: RabbitMQ excels at task distribution. Use it when your consumers are workers and your producers are requesters.

Amazon SQS

SQS is AWS’s fully managed queue service. You pay per message, don’t manage infrastructure, and AWS handles scaling and durability.

Queue Types:

Standard Queues:

At-least-once delivery (messages may be delivered multiple times)
Best-effort ordering (order not guaranteed)
Unlimited throughput
Good for distributed work distribution

FIFO Queues:

Exactly-once processing (no duplicates)
Strict first-in-first-out ordering
Limited to 300 messages per second (batching can increase throughput)
Good when order and uniqueness matter

Key Features:

Fully managed (serverless)
Dead letter queues (messages that can’t be processed)
Message visibility timeout (prevents duplicate processing)
Long polling (reduces API calls)
Integrates with AWS Lambda (triggers automatically)
Fine-grained access control (IAM)
No message replay (messages deleted after consumption)

When to Use: AWS-native architectures, when you want zero operational overhead, when you’re comfortable with limited ordering/delivery guarantees, when integrating with Lambda.

Typical Use Cases:

Decoupling AWS services (Lambda to Lambda, EC2 to EC2)
Serverless architectures with Lambda workers
Task scheduling
Email/notification queues

Considerations: No message replay (no event history). At-least-once requires idempotent consumers. Standard queues have eventual consistency. Pricing can grow quickly at large scale.

Pro Tip: Use SQS for fire-and-forget work distribution. Use SNS+SQS pattern for fan-out to multiple targets.

Amazon SNS

SNS is AWS’s pub/sub service. Producers publish to topics, and SNS delivers to multiple subscribers.

Subscribers can be:

SQS queues (SNS+SQS fan-out pattern)
Lambda functions (serverless processing)
HTTP/HTTPS endpoints (webhooks)
Email addresses
SMS numbers
Mobile push notifications

Key Features:

Fully managed
Fan-out to multiple subscribers
Message filtering (subscribers filter by attributes)
Message deduplication (within 5-minute window)
FIFO topics (strict ordering and exactly-once)
Integrates seamlessly with SQS

When to Use: When one message needs to reach multiple targets, when you want serverless pub/sub, when combining with SQS for resilience.

Typical Use Cases:

Fan-out from one service to many
Event notifications
Alerts and monitoring
Multi-channel notifications (email + SMS + push)

Considerations: Limited message retention (none—subscribers must be ready immediately). Best used in combination with SQS for durability. Less flexible than RabbitMQ exchanges.

The SNS+SQS Pattern: This is a common AWS architecture: SNS publishes to multiple SQS queues, each with its own consumer. This gives you fan-out (like Kafka topics) with decoupled processing and replay via SQS visibility timeout and Dead Letter Queues.

Apache Pulsar

Pulsar is a newer messaging platform designed to overcome Kafka’s limitations. It separates compute (brokers) from storage (BookKeeper), enabling horizontal scaling and multi-tenancy.

Key Features:

Separation of compute and storage (scales independently)
Multi-tenancy built-in (isolate workloads by tenant)
Both queue and pub/sub semantics in one system
Geo-replication (built-in across regions)
Exactly-once semantics
Tiered storage (hot/warm/cold data)
Schema management
Functions (serverless compute in Pulsar)

When to Use: When you need Kafka-like scalability but want better multi-tenancy, when you want built-in geo-replication, when you need both queue and pub/sub semantics.

Typical Use Cases:

Large-scale event streaming in multi-tenant platforms
Geo-distributed event systems
Hybrid queue/pub-sub workloads

Considerations: Smaller community than Kafka. Operational complexity is moderate to high. Less operational knowledge available in the industry. Growing but not yet as battle-tested as Kafka.

Pro Tip: Pulsar is “Kafka done right” with hindsight. If you’re starting a new system and willing to manage complexity, Pulsar is worth evaluating.

NATS

NATS is a lightweight, high-performance pub/sub system designed for microservices and edge computing.

Key Features:

Minimal overhead (very fast, low latency)
Simple pub/sub model
JetStream for persistence (durability like Kafka)
Request-reply pattern (built-in RPC)
Subject-based routing (flexible like RabbitMQ)
Good for IoT and edge computing
Single binary (easy to deploy)

When to Use: Microservices on Kubernetes, IoT/edge scenarios, when you want simplicity with good performance.

Typical Use Cases:

Microservices communication
IoT device communication
Edge computing (proximity to application)
Simple event streaming (with JetStream)

Considerations: Smaller ecosystem than Kafka/RabbitMQ. Less widely used in enterprises. Request-reply adds latency vs pure pub/sub.

Pro Tip: NATS is excellent for microservices on Kubernetes. It’s simple, performant, and has minimal resource overhead.

Google Pub/Sub

Google Cloud’s managed pub/sub service. Similar to SNS but with better message durability.

Key Features:

Fully managed
At-least-once delivery (with deduplication)
Global by default (messages distributed to all regions)
Serverless (auto-scaling)
Integration with Cloud Dataflow for stream processing
Snapshots (save and restore subscriber position)
Message ordering (optional, per-subscription)
Dead letter topics

When to Use: GCP-native architectures, when you want managed pub/sub with good durability, when building serverless applications on Google Cloud.

Typical Use Cases:

GCP service integration
Event streaming on Google Cloud
Serverless event processing (Cloud Functions)
Analytics on Google Cloud

Considerations: Google Cloud only. Smaller ecosystem than AWS SQS/SNS. Less community knowledge than open-source alternatives.

Message Queue Comparison Matrix

Technology	Model	Delivery Guarantee	Ordering	Throughput	Managed Option	Best For
Kafka	Pub/sub (event stream)	At-least-once	Per-partition	Millions/sec	Confluent Cloud, AWS MSK	Event streaming, high volume, replay
RabbitMQ	Queue + Pub/sub	At-least-once	FIFO (per queue)	Hundreds K/sec	CloudAMQP, Pivotal	Task queues, complex routing
SQS	Queue	At-least-once (standard) or Exactly-once (FIFO)	Best-effort (standard) or Strict (FIFO)	High (standard), limited (FIFO)	AWS SQS (fully managed)	AWS service decoupling, serverless
SNS	Pub/sub	At-least-once	Not guaranteed	High	AWS SNS (fully managed)	Fan-out to multiple targets
Pulsar	Queue + Pub/sub	Exactly-once	Per-partition	Millions/sec	StreamNative Cloud, Apache	Event streaming, multi-tenancy, geo-replication
NATS	Pub/sub + Request-reply	At-most-once (core) or At-least-once (JetStream)	Not guaranteed	Very high	NATS Cloud	Microservices, edge, low latency
Pub/Sub	Pub/sub	At-least-once	Optional (per-subscription)	High	Google Pub/Sub (fully managed)	GCP native, serverless

Decision Framework

Choose Kafka if:

You have high message volume (millions/second)
You want message replay and immutable event history
You need to scale consumers horizontally
You’re building event streaming or event sourcing
You can handle operational complexity

Choose RabbitMQ if:

You need complex routing patterns (topic-based, header-based)
You have traditional task queue workloads (workers processing jobs)
You prefer a proven, stable broker
Your message volume is moderate (millions/second is achievable but not its strength)

Choose SQS if:

You’re on AWS and want minimal operational overhead
You have intermittent, bursty workloads
You want integration with Lambda
You’re comfortable with eventual consistency
You don’t need message replay

Choose SNS if:

One message needs to reach multiple distinct targets
You want fan-out to different services
You’re combining with SQS for durability

Choose Pulsar if:

You’re starting a new system that needs Kafka-scale but want better design
You need both queue and pub/sub semantics
Multi-tenancy is important
You’re comfortable with higher operational complexity than Kafka

Choose NATS if:

You’re building microservices on Kubernetes
Simplicity and low latency are priorities
You want a lightweight message broker
You’re in edge/IoT scenarios

Choose Google Pub/Sub if:

You’re on Google Cloud
You want a fully managed pub/sub service
Durability and deduplication are important

Delivery Guarantees Explained

At-most-once: Message may be lost. Producer sends once, no retries. Fast but not durable.

At-least-once: Message will reach the consumer at least once. May be delivered multiple times. Requires idempotent consumer logic (apply the same message multiple times = same result).

Exactly-once: Message delivered exactly once. Most expensive to implement (requires coordination between producer and consumer). Some systems claim this but really provide idempotent at-least-once.

Idempotency is your friend. If your consumer can handle receiving the same message twice and produce the same result, you don’t need exactly-once semantics. This is often easier than building exactly-once.

Message Ordering Considerations

Global order: All messages processed in sequence. Limited to one consumer. Simplest but least scalable.

Partition/shard order: Messages within a partition maintain order. Different partitions process in parallel. Kafka, Pulsar, and some RabbitMQ setups provide this.

No ordering guarantee: Messages may arrive out of order. Kafka standard queues and most pub/sub systems offer this. Fastest.

Choose based on your requirements. Most systems don’t need global ordering and scale better with partition-level ordering.

Key Takeaways

Kafka is the standard for event streaming and high-volume messaging. Use it when message history matters and you need to replay data.
RabbitMQ remains excellent for task queues and complex routing patterns. Choose it for traditional broker workloads.
SQS is the AWS default for decoupling services and serverless architectures. Zero operational overhead but limited guarantees.
SNS provides fan-out to multiple targets. Often used with SQS for durability and decoupling.
Pulsar is Kafka’s spiritual successor with better design. Consider it for new systems if you’re willing to manage complexity.
NATS is the lightweight option for microservices and edge computing.
Google Pub/Sub is the GCP equivalent to SQS/SNS.

For most new systems, Kafka or managed cloud queues (SQS, Pub/Sub) are safe choices. RabbitMQ is proven for task distribution. NATS is excellent for microservices. Choose based on scale, ordering requirements, and operational comfort.