Chat Application
The Problem We’re Solving
Your phone buzzes. A message arrives. Instantly. Across the globe, you see your friend’s typing indicator and within milliseconds receive their reply. Chat applications like WhatsApp and Slack make this feel effortless, but behind the scenes, you’re coordinating hundreds of millions of concurrent connections, guaranteeing message ordering, managing state for online/offline status, and ensuring not a single message is lost.
Designing a real-time chat system is one of the most challenging infrastructure problems at scale. We’re building a system that supports 500M users, 100M concurrent connections, and must deliver messages in under 100ms. Oh, and end-to-end encryption too.
Functional Requirements
Let’s define what we’re building:
- 1:1 messaging between any two users
- Group chats with up to 500 members
- Real-time message delivery — messages appear on recipient’s device immediately if online
- Message history — retrieve conversation history, searching for past messages
- Presence detection — online/offline status, last seen timestamp
- Delivery status — message states (sent, delivered, read) with timestamps
- Media sharing — images, files up to 100MB, voice messages
- Push notifications — notify offline users when they receive messages
- Typing indicators — show when someone is typing
- Receipts — per-message acknowledgment (sent, delivered, read)
Non-Functional Requirements
- Message delivery latency: under 100ms for online recipients
- Scale: 500M users, 100M concurrent connections
- Durability: at-least-once delivery — no message loss
- Message ordering: within a single conversation, messages appear in order
- Availability: 99.99% uptime
- Consistency: eventual consistency acceptable for presence, strong consistency for message state
Scale Estimation
Let’s run the numbers:
Message volume:
- 500M users, 100M concurrent (20% DAU)
- Average user sends 40 messages/day
- 100M × 40 = 4 billion messages/day
- 4B ÷ 86,400 seconds = ~46,000 messages/second (steady state)
- Peak load (3x) = ~140,000 messages/second
Concurrent connections:
- 100M users online simultaneously
- Each maintaining a persistent WebSocket connection
- Memory per connection (rough): 1–5KB metadata, so 100M × 3KB = 300GB total memory across connection servers
Storage:
- Message retention: 1 year for all messages
- 4B messages/day × 365 days = 1.46 trillion messages
- Average message size: 500 bytes
- Total: ~730TB of message storage
| Metric | Value | Notes |
|---|---|---|
| Steady-state message QPS | 46K | Peak ~140K/sec (3x multiplier) |
| Concurrent connections | 100M | Each user with persistent WebSocket |
| Annual message volume | 1.46T | At 40 messages/user/day |
| Message storage | 730TB | 1 year retention, 500B average |
| Group conversations avg size | 8–10 members | Impacts fan-out costs |
High-Level Architecture
graph TB
Client["Mobile/Web Client"]
WSGateway["WebSocket Gateway<br/>(Connection Pool)"]
ChatService["Chat Service<br/>(Message Logic)"]
MsgQueue["Message Queue<br/>(Kafka)"]
MsgStore["Message Storage<br/>(Cassandra)"]
Presence["Presence Store<br/>(Redis)"]
MediaService["Media Service<br/>(S3 + CDN)"]
PushService["Push Notification<br/>Service"]
Client -->|WebSocket| WSGateway
WSGateway -->|Validate &<br/>Persist| ChatService
ChatService -->|Publish| MsgQueue
MsgQueue -->|Consume| ChatService
ChatService -->|Store msg| MsgStore
ChatService -->|Send to recipient| WSGateway
WSGateway -->|Update| Presence
PushService -->|Query| Presence
Client -->|Upload| MediaService
PushService -->|iOS/Android| Client
The architecture hinges on three paths:
- Message ingestion: client sends message to WebSocket gateway
- Message delivery: ChatService routes to recipient’s gateway
- Offline handling: if recipient offline, notify via push service
Deep Dive: WebSocket Connection Management
At 100M concurrent connections, a single server cannot handle the load. We need to distribute them across a fleet:
graph LR
A["Client A"]
B["Client B"]
C["Client C"]
A -->|ws://server-1| GW1["Gateway Server 1<br/>(20M conn)"]
B -->|ws://server-2| GW2["Gateway Server 2<br/>(20M conn)"]
C -->|ws://server-3| GW3["Gateway Server 3<br/>(20M conn)"]
GW1 -.->|Find Client B| Registry["Connection<br/>Registry<br/>(Redis)"]
GW2 -.->|Found: server-2| Registry
GW1 -->|Route msg| GW2
GW2 -->|Deliver| B
Connection Registry in Redis: When Client A connects to Server 1, we store user_id -> {server_id: 2, connection_id: 'abc123'} in Redis. When a message for Client B arrives, we:
- Query Redis: “Where is user B?”
- Discover B is on Server 2
- Route the message via internal RPC or message queue
Sticky Sessions: To minimize re-routing and maintain connection affinity, we use sticky sessions in the load balancer. However, this creates a single point of failure. Solution: when a gateway server dies, Redis knows about it, and clients reconnect to a healthy server.
Heartbeat and Timeout: Clients send a heartbeat every 30 seconds. If the server doesn’t receive one for 60 seconds, it marks the user offline. This detects network failures and closed connections gracefully.
Message Flow: Step-by-Step
- Client A sends message:
POST /api/sendwith message content, recipient_id - API Gateway validates: checks authentication, message format, recipient exists
- ChatService persists: writes message to Cassandra with status = “sent”
- Publish to Kafka: partition by conversation_id for ordering
- Kafka topic
messages-{conversation_id}consumed by:- The recipient’s gateway (if online)
- The sender’s archive service
- Send to recipient: if recipient online, gateway pushes via WebSocket
- Recipient receives: sends back “delivered” receipt
- Update message status: sender’s client shows “delivered”
- Recipient reads message: client sends “read” receipt
- Sender sees “read”: status updates in real-time
// Pseudocode: Message send flow
async function sendMessage(senderId, recipientId, content) {
const message = {
id: uuid(),
sender_id: senderId,
recipient_id: recipientId,
content,
timestamp: Date.now(),
status: 'sent'
};
// Persist to Cassandra
await cassandra.insert('messages', message);
// Publish to Kafka for delivery
const conversationId = deriveConversationId(senderId, recipientId);
await kafka.publish(`messages-${conversationId}`, { type: 'message', payload: message });
// Check if recipient online
const recipientSession = await redis.get(`user:${recipientId}:session`);
if (recipientSession) {
// Recipient is online—route to their gateway
await routeToGateway(recipientSession.server_id, message);
await updateMessageStatus(message.id, 'delivered');
} else {
// Recipient offline—queue for push notification
await pushQueue.enqueue({ user_id: recipientId, message_id: message.id });
}
return message;
}
Group Messaging and Fan-out
For a group with 10 members, sending one message requires delivering to 9 recipients. At scale, this is expensive:
Naive approach: for each group message, loop through all members and send—but this creates 9× the load on the delivery system.
Better approach: Kafka fan-out. One Kafka topic per group conversation:
- Topic
group-chat-{group_id}has one message: “message_123 sent by Alice” - Each member’s consumer subscribes to this topic
- Each member’s local process delivers the message to their client
- This parallelizes the fan-out across the member base
For very large groups (say 500 members), even Kafka fan-out can bottleneck. Advanced solution: write amplification on the server side. When Alice sends a message to a 500-person group:
- Store one copy of the message in the database
- Create 500 delivery records:
(user_id, message_id, delivery_status) - Each delivery worker processes these asynchronously
This trades storage (500 extra rows) for processing efficiency (no real-time surge).
Presence Detection and Online Status
Users want to know who’s available. The presence system is simple but crucial:
CREATE TABLE presence (
user_id UUID PRIMARY KEY,
status VARCHAR, -- 'online', 'away', 'offline'
last_seen TIMESTAMP,
device VARCHAR, -- 'ios', 'android', 'web'
connection_server_id VARCHAR
);
Heartbeat Mechanism:
- Client sends heartbeat every 30 seconds (via WebSocket ping/pong)
- Server updates
last_seenin Redis (fast, no DB write) - If heartbeat missed for 60 seconds, mark user offline
Subscriptions: When you open a chat, you subscribe to presence updates for that user:
- Server publishes to Redis pub/sub:
presence-change-{user_id}: online - Your client subscribes; receives the update instantly
Note: Presence is eventually consistent. A user might appear online for 60 seconds after going offline (due to heartbeat timeout), and that’s acceptable for a chat UX.
Message Storage in Cassandra
We use Cassandra (not traditional SQL) because:
- Append-heavy workload: most operations insert new messages
- Time-series nature: queries are “give me messages from this conversation after this timestamp”
- Horizontal scaling: add nodes, not shards
CREATE TABLE messages (
conversation_id UUID, -- Partition key
timestamp BIGINT, -- Clustering key (with reverse sort)
message_id UUID,
sender_id UUID,
content TEXT,
status VARCHAR, -- 'sent', 'delivered', 'read'
PRIMARY KEY (conversation_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
Partition by conversation_id: All messages for a 1:1 chat or group are co-located on the same Cassandra node (or replica). This makes “fetch last 50 messages” very fast.
Reverse timestamp clustering: Newer messages appear first, so pagination is natural.
Time-based cleanup: Cassandra’s TTL (Time To Live) automatically deletes messages after 1 year via background compaction.
Delivery Receipts and Message Status
Messages transition through states:
┌─────────┐
│ sent │ Message stored, timestamp recorded
└────┬────┘
│
▼
┌─────────────┐
│ delivered │ Recipient's device received it
└────┬────────┘
│
▼
┌──────┐
│ read │ Recipient opened the message
└──────┘
Each transition is a small update to Cassandra:
async function markMessageAsRead(messageId) {
await cassandra.update(
'messages',
{ status: 'read' },
{ message_id: messageId }
);
// Publish to Kafka so sender's client can update UI
await kafka.publish('read-receipts', { message_id: messageId, timestamp: Date.now() });
}
This gives users real-time feedback: they see exactly when their message was delivered and read. For sender’s client, subscribing to read-receipts topic keeps the UI synchronized.
End-to-End Encryption (E2EE)
Modern chat systems encrypt messages so that even the server can’t read them. We use the Signal Protocol:
High-level flow:
- Alice and Bob exchange public keys (asymmetric encryption)
- They derive a shared secret (key agreement)
- Each message is encrypted with a derived per-message key (symmetric encryption)
- The per-message key is “ratcheted” (advanced) for forward secrecy
From the server’s perspective:
- Message arrives as encrypted blob
- We store it as-is (server never decrypts)
- We deliver it to recipient’s device
- Only recipient’s device has the keys to decrypt
Trade-off: With E2EE, the server cannot search messages on the user’s behalf. Users must search locally on their device. This is a privacy win but a feature loss.
Scaling Strategies
WebSocket Gateway Sharding: Shard by user_id % num_gateways. Each gateway handles a disjoint set of users. When user B comes online, they consistently connect to the same gateway (sticky routing), minimizing re-hashing.
Kafka Partitions: Create one Kafka partition per conversation. This guarantees ordering within a conversation (all messages for that convo flow through one partition) while allowing parallel processing across conversations.
Cassandra Ring: Add nodes as storage grows. Cassandra rebalances data automatically (each node holds a token range).
Media Upload: Separate media service backed by S3 + CloudFront CDN. When user uploads an image, it goes directly to S3, bypassing the chat service. The chat service only stores metadata: { media_type: 'image', url: 'https://cdn.../media/abc123.jpg' }.
Push Notifications: Integrate with iOS/Android push services (APNs, FCM). When a message arrives for an offline user, the push service sends a notification. When user taps it, the app reconnects and fetches the message from Cassandra.
Key Trade-offs
| Trade-off | Option A | Option B | Our Choice |
|---|---|---|---|
| Connection protocol | WebSocket | Long polling | WebSocket (lower latency, bidirectional) |
| Message ordering | Server-imposed | Client-imposed | Server-imposed (easier, consistent) |
| Presence consistency | Strong (slow) | Eventual (fast) | Eventual (40–60s acceptable) |
| Storage per message | One copy | Per-recipient copy | One copy + delivery records (hybrid) |
| E2EE | Full (no search) | None (can search) | Full (privacy priority) |
Did You Know?
Idempotency keys: Network failures can cause duplicate delivery. If a message is sent but the client times out before receiving acknowledgment, it might retry. Solution: store an idempotency key (unique per message attempt) and deduplicate on the server. This is why chat apps feel so reliable—they’re built on idempotent operations.
Read-receipt storms: In a group chat with 500 members, when one member reads the message, all 499 other members get the “read” notification. Aggregating read receipts and batching notifications prevents read-receipt message storms.
Practice Exercise
Extend the system for:
-
Voice and video calls — users initiate 1:1 calls through the chat app. How would you signal (initiate), establish peer-to-peer connections, and fall back if P2P fails? (Hint: STUN servers, TURN servers, WebRTC signaling over WebSocket.)
-
Message threading and reactions — users reply to specific messages and add emoji reactions. How would you model this in your database? What new queries would you need?
-
Slack-like @mentions and search — users can search for messages and be mentioned. With E2EE, how do you support full-text search? (Hint: client-side indexing or encrypted search.)
Connection to Chapter 23
You’ve now designed two critical real-world systems: file storage (handling massive data, durability, sync) and chat (handling real-time communication, ordering, reliability). These patterns—chunking, replication, event streaming, eventual consistency, idempotency—are the building blocks of distributed systems.
In Chapter 23, we’ll synthesize everything you’ve learned throughout this book. We’ll walk through how to approach a system design interview systematically, starting from vague requirements to a complete architecture. You’ll learn how to communicate your trade-offs, defend your choices, and iterate when the interviewer raises new constraints. Because ultimately, system design is about making informed decisions under uncertainty—and that’s an art as much as a science.