Chat Application

The Problem We’re Solving

Your phone buzzes. A message arrives. Instantly. Across the globe, you see your friend’s typing indicator and within milliseconds receive their reply. Chat applications like WhatsApp and Slack make this feel effortless, but behind the scenes, you’re coordinating hundreds of millions of concurrent connections, guaranteeing message ordering, managing state for online/offline status, and ensuring not a single message is lost.

Designing a real-time chat system is one of the most challenging infrastructure problems at scale. We’re building a system that supports 500M users, 100M concurrent connections, and must deliver messages in under 100ms. Oh, and end-to-end encryption too.

Functional Requirements

Let’s define what we’re building:

1:1 messaging between any two users
Group chats with up to 500 members
Real-time message delivery — messages appear on recipient’s device immediately if online
Message history — retrieve conversation history, searching for past messages
Presence detection — online/offline status, last seen timestamp
Delivery status — message states (sent, delivered, read) with timestamps
Media sharing — images, files up to 100MB, voice messages
Push notifications — notify offline users when they receive messages
Typing indicators — show when someone is typing
Receipts — per-message acknowledgment (sent, delivered, read)

Non-Functional Requirements

Message delivery latency: under 100ms for online recipients
Scale: 500M users, 100M concurrent connections
Durability: at-least-once delivery — no message loss
Message ordering: within a single conversation, messages appear in order
Availability: 99.99% uptime
Consistency: eventual consistency acceptable for presence, strong consistency for message state

Scale Estimation

Let’s run the numbers:

Message volume:

500M users, 100M concurrent (20% DAU)
Average user sends 40 messages/day
100M × 40 = 4 billion messages/day
4B ÷ 86,400 seconds = ~46,000 messages/second (steady state)
Peak load (3x) = ~140,000 messages/second

Concurrent connections:

100M users online simultaneously
Each maintaining a persistent WebSocket connection
Memory per connection (rough): 1–5KB metadata, so 100M × 3KB = 300GB total memory across connection servers

Storage:

Message retention: 1 year for all messages
4B messages/day × 365 days = 1.46 trillion messages
Average message size: 500 bytes
Total: ~730TB of message storage

Metric	Value	Notes
Steady-state message QPS	46K	Peak ~140K/sec (3x multiplier)
Concurrent connections	100M	Each user with persistent WebSocket
Annual message volume	1.46T	At 40 messages/user/day
Message storage	730TB	1 year retention, 500B average
Group conversations avg size	8–10 members	Impacts fan-out costs

High-Level Architecture

graph TB
    Client["Mobile/Web Client"]
    WSGateway["WebSocket Gateway<br/>(Connection Pool)"]
    ChatService["Chat Service<br/>(Message Logic)"]
    MsgQueue["Message Queue<br/>(Kafka)"]

    MsgStore["Message Storage<br/>(Cassandra)"]
    Presence["Presence Store<br/>(Redis)"]
    MediaService["Media Service<br/>(S3 + CDN)"]
    PushService["Push Notification<br/>Service"]

    Client -->|WebSocket| WSGateway
    WSGateway -->|Validate &<br/>Persist| ChatService
    ChatService -->|Publish| MsgQueue
    MsgQueue -->|Consume| ChatService
    ChatService -->|Store msg| MsgStore
    ChatService -->|Send to recipient| WSGateway

    WSGateway -->|Update| Presence
    PushService -->|Query| Presence
    Client -->|Upload| MediaService
    PushService -->|iOS/Android| Client

The architecture hinges on three paths:

Message ingestion: client sends message to WebSocket gateway
Message delivery: ChatService routes to recipient’s gateway
Offline handling: if recipient offline, notify via push service

Deep Dive: WebSocket Connection Management

At 100M concurrent connections, a single server cannot handle the load. We need to distribute them across a fleet:

graph LR
    A["Client A"]
    B["Client B"]
    C["Client C"]

    A -->|ws://server-1| GW1["Gateway Server 1<br/>(20M conn)"]
    B -->|ws://server-2| GW2["Gateway Server 2<br/>(20M conn)"]
    C -->|ws://server-3| GW3["Gateway Server 3<br/>(20M conn)"]

    GW1 -.->|Find Client B| Registry["Connection<br/>Registry<br/>(Redis)"]
    GW2 -.->|Found: server-2| Registry
    GW1 -->|Route msg| GW2
    GW2 -->|Deliver| B

Connection Registry in Redis: When Client A connects to Server 1, we store user_id -> {server_id: 2, connection_id: 'abc123'} in Redis. When a message for Client B arrives, we:

Query Redis: “Where is user B?”
Discover B is on Server 2
Route the message via internal RPC or message queue

Sticky Sessions: To minimize re-routing and maintain connection affinity, we use sticky sessions in the load balancer. However, this creates a single point of failure. Solution: when a gateway server dies, Redis knows about it, and clients reconnect to a healthy server.

Heartbeat and Timeout: Clients send a heartbeat every 30 seconds. If the server doesn’t receive one for 60 seconds, it marks the user offline. This detects network failures and closed connections gracefully.

Message Flow: Step-by-Step

Client A sends message: POST /api/send with message content, recipient_id
API Gateway validates: checks authentication, message format, recipient exists
ChatService persists: writes message to Cassandra with status = “sent”
Publish to Kafka: partition by conversation_id for ordering
Kafka topic messages-{conversation_id} consumed by:
- The recipient’s gateway (if online)
- The sender’s archive service
Send to recipient: if recipient online, gateway pushes via WebSocket
Recipient receives: sends back “delivered” receipt
Update message status: sender’s client shows “delivered”
Recipient reads message: client sends “read” receipt
Sender sees “read”: status updates in real-time

// Pseudocode: Message send flow
async function sendMessage(senderId, recipientId, content) {
  const message = {
    id: uuid(),
    sender_id: senderId,
    recipient_id: recipientId,
    content,
    timestamp: Date.now(),
    status: 'sent'
  };

  // Persist to Cassandra
  await cassandra.insert('messages', message);

  // Publish to Kafka for delivery
  const conversationId = deriveConversationId(senderId, recipientId);
  await kafka.publish(`messages-${conversationId}`, { type: 'message', payload: message });

  // Check if recipient online
  const recipientSession = await redis.get(`user:${recipientId}:session`);
  if (recipientSession) {
    // Recipient is online—route to their gateway
    await routeToGateway(recipientSession.server_id, message);
    await updateMessageStatus(message.id, 'delivered');
  } else {
    // Recipient offline—queue for push notification
    await pushQueue.enqueue({ user_id: recipientId, message_id: message.id });
  }

  return message;
}

Group Messaging and Fan-out

For a group with 10 members, sending one message requires delivering to 9 recipients. At scale, this is expensive:

Naive approach: for each group message, loop through all members and send—but this creates 9× the load on the delivery system.

Better approach: Kafka fan-out. One Kafka topic per group conversation:

Topic group-chat-{group_id} has one message: “message_123 sent by Alice”
Each member’s consumer subscribes to this topic
Each member’s local process delivers the message to their client
This parallelizes the fan-out across the member base

For very large groups (say 500 members), even Kafka fan-out can bottleneck. Advanced solution: write amplification on the server side. When Alice sends a message to a 500-person group:

Store one copy of the message in the database
Create 500 delivery records: (user_id, message_id, delivery_status)
Each delivery worker processes these asynchronously

This trades storage (500 extra rows) for processing efficiency (no real-time surge).

Presence Detection and Online Status

Users want to know who’s available. The presence system is simple but crucial:

CREATE TABLE presence (
  user_id UUID PRIMARY KEY,
  status VARCHAR,              -- 'online', 'away', 'offline'
  last_seen TIMESTAMP,
  device VARCHAR,              -- 'ios', 'android', 'web'
  connection_server_id VARCHAR
);

Heartbeat Mechanism:

Client sends heartbeat every 30 seconds (via WebSocket ping/pong)
Server updates last_seen in Redis (fast, no DB write)
If heartbeat missed for 60 seconds, mark user offline

Subscriptions: When you open a chat, you subscribe to presence updates for that user:

Server publishes to Redis pub/sub: presence-change-{user_id}: online
Your client subscribes; receives the update instantly

Note: Presence is eventually consistent. A user might appear online for 60 seconds after going offline (due to heartbeat timeout), and that’s acceptable for a chat UX.

Message Storage in Cassandra

We use Cassandra (not traditional SQL) because:

Append-heavy workload: most operations insert new messages
Time-series nature: queries are “give me messages from this conversation after this timestamp”
Horizontal scaling: add nodes, not shards

CREATE TABLE messages (
  conversation_id UUID,        -- Partition key
  timestamp BIGINT,            -- Clustering key (with reverse sort)
  message_id UUID,
  sender_id UUID,
  content TEXT,
  status VARCHAR,              -- 'sent', 'delivered', 'read'
  PRIMARY KEY (conversation_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

Partition by conversation_id: All messages for a 1:1 chat or group are co-located on the same Cassandra node (or replica). This makes “fetch last 50 messages” very fast.

Reverse timestamp clustering: Newer messages appear first, so pagination is natural.

Time-based cleanup: Cassandra’s TTL (Time To Live) automatically deletes messages after 1 year via background compaction.

Delivery Receipts and Message Status

Messages transition through states:

┌─────────┐
│  sent   │ Message stored, timestamp recorded
└────┬────┘
     │
     ▼
┌─────────────┐
│  delivered  │ Recipient's device received it
└────┬────────┘
     │
     ▼
┌──────┐
│ read │ Recipient opened the message
└──────┘

Each transition is a small update to Cassandra:

async function markMessageAsRead(messageId) {
  await cassandra.update(
    'messages',
    { status: 'read' },
    { message_id: messageId }
  );

  // Publish to Kafka so sender's client can update UI
  await kafka.publish('read-receipts', { message_id: messageId, timestamp: Date.now() });
}

This gives users real-time feedback: they see exactly when their message was delivered and read. For sender’s client, subscribing to read-receipts topic keeps the UI synchronized.

End-to-End Encryption (E2EE)

Modern chat systems encrypt messages so that even the server can’t read them. We use the Signal Protocol:

High-level flow:

Alice and Bob exchange public keys (asymmetric encryption)
They derive a shared secret (key agreement)
Each message is encrypted with a derived per-message key (symmetric encryption)
The per-message key is “ratcheted” (advanced) for forward secrecy

From the server’s perspective:

Message arrives as encrypted blob
We store it as-is (server never decrypts)
We deliver it to recipient’s device
Only recipient’s device has the keys to decrypt

Trade-off: With E2EE, the server cannot search messages on the user’s behalf. Users must search locally on their device. This is a privacy win but a feature loss.

Scaling Strategies

WebSocket Gateway Sharding: Shard by user_id % num_gateways. Each gateway handles a disjoint set of users. When user B comes online, they consistently connect to the same gateway (sticky routing), minimizing re-hashing.

Kafka Partitions: Create one Kafka partition per conversation. This guarantees ordering within a conversation (all messages for that convo flow through one partition) while allowing parallel processing across conversations.

Cassandra Ring: Add nodes as storage grows. Cassandra rebalances data automatically (each node holds a token range).

Media Upload: Separate media service backed by S3 + CloudFront CDN. When user uploads an image, it goes directly to S3, bypassing the chat service. The chat service only stores metadata: { media_type: 'image', url: 'https://cdn.../media/abc123.jpg' }.

Push Notifications: Integrate with iOS/Android push services (APNs, FCM). When a message arrives for an offline user, the push service sends a notification. When user taps it, the app reconnects and fetches the message from Cassandra.

Key Trade-offs

Trade-off	Option A	Option B	Our Choice
Connection protocol	WebSocket	Long polling	WebSocket (lower latency, bidirectional)
Message ordering	Server-imposed	Client-imposed	Server-imposed (easier, consistent)
Presence consistency	Strong (slow)	Eventual (fast)	Eventual (40–60s acceptable)
Storage per message	One copy	Per-recipient copy	One copy + delivery records (hybrid)
E2EE	Full (no search)	None (can search)	Full (privacy priority)

Did You Know?

Idempotency keys: Network failures can cause duplicate delivery. If a message is sent but the client times out before receiving acknowledgment, it might retry. Solution: store an idempotency key (unique per message attempt) and deduplicate on the server. This is why chat apps feel so reliable—they’re built on idempotent operations.

Read-receipt storms: In a group chat with 500 members, when one member reads the message, all 499 other members get the “read” notification. Aggregating read receipts and batching notifications prevents read-receipt message storms.

Practice Exercise

Extend the system for:

Voice and video calls — users initiate 1:1 calls through the chat app. How would you signal (initiate), establish peer-to-peer connections, and fall back if P2P fails? (Hint: STUN servers, TURN servers, WebRTC signaling over WebSocket.)
Message threading and reactions — users reply to specific messages and add emoji reactions. How would you model this in your database? What new queries would you need?
Slack-like @mentions and search — users can search for messages and be mentioned. With E2EE, how do you support full-text search? (Hint: client-side indexing or encrypted search.)

Connection to Chapter 23

You’ve now designed two critical real-world systems: file storage (handling massive data, durability, sync) and chat (handling real-time communication, ordering, reliability). These patterns—chunking, replication, event streaming, eventual consistency, idempotency—are the building blocks of distributed systems.

In Chapter 23, we’ll synthesize everything you’ve learned throughout this book. We’ll walk through how to approach a system design interview systematically, starting from vague requirements to a complete architecture. You’ll learn how to communicate your trade-offs, defend your choices, and iterate when the interviewer raises new constraints. Because ultimately, system design is about making informed decisions under uncertainty—and that’s an art as much as a science.