Notification System
The Problem: A Multi-Channel Notification Architecture
Let’s tackle a problem you’ll face in almost every real-world application: how do you reliably send notifications to hundreds of millions of users across multiple channels — push notifications, SMS, email, and in-app messages — without overwhelming your infrastructure or annoying your users?
Imagine you’re building this for a platform with 100M users. You need to send password reset codes, order confirmations, marketing campaigns, security alerts, and social notifications. Each channel has different characteristics: push notifications are fast but only work on mobile apps; SMS is reliable but expensive; email is flexible but slow; in-app is instant but only reaches active users.
This is the notification system interview question. It tests your understanding of message queues, worker processes, rate limiting, third-party integrations, and handling scale gracefully.
Functional and Non-Functional Requirements
Functional requirements:
- Send notifications via multiple channels (push, SMS, email, in-app)
- Support user preferences (opt-in/opt-out, quiet hours, channel preferences)
- Template system for dynamic message content
- Schedule notifications for future delivery
- Batch or digest related notifications
- Track delivery status and failures
- Handle retries and dead-letter queues
Non-functional requirements:
- Throughput: Handle millions of notifications per hour
- Latency: Real-time notifications (like OTP codes) delivered within seconds; batch notifications are flexible
- Delivery guarantee: At-least-once delivery semantics
- Availability: System should be highly available even if one channel provider is down
- User-centric: Configurable per-user (rate limits, preferences, quiet hours)
Scale Estimation
Let’s do the math:
- 100M users, average 5 notifications per user per day
- 500M notifications/day
- Average: ~5,800 notifications/second
- Peak: ~20,000 notifications/second (during high-traffic hours)
- Storage: Assume notification metadata retention of 30 days; a notification record with metadata is about 500 bytes, so 500M × 500 bytes = 250 GB per day of storage
This is substantial. We can’t simply process everything synchronously or store everything in a single database.
High-Level Architecture
Here’s how we structure the system:
graph LR
Client["API Client<br/>(App/Service)"]
APIGw["API Gateway"]
NotifService["Notification Service<br/>(Validate, Enrich)"]
PriorityQueue["Priority Queue<br/>(RabbitMQ/Kafka)"]
PushWorker["Push Worker<br/>(APNs, FCM)"]
SMSWorker["SMS Worker<br/>(Twilio, AWS SNS)"]
EmailWorker["Email Worker<br/>(SendGrid, SES)"]
InAppWorker["In-App Worker<br/>(Write to DB)"]
PrefStore["User Preference<br/>Store"]
TemplateEngine["Template<br/>Engine"]
Analytics["Analytics<br/>Service"]
DLQ["Dead Letter<br/>Queue"]
Client -->|POST /notify| APIGw
APIGw --> NotifService
NotifService --> PrefStore
NotifService --> TemplateEngine
NotifService --> PriorityQueue
PriorityQueue -->|High Priority| PushWorker
PriorityQueue -->|Standard| SMSWorker
PriorityQueue -->|Standard| EmailWorker
PriorityQueue -->|Real-time| InAppWorker
PushWorker --> Analytics
SMSWorker --> Analytics
EmailWorker --> Analytics
InAppWorker --> Analytics
PushWorker -->|Failed| DLQ
SMSWorker -->|Failed| DLQ
EmailWorker -->|Failed| DLQ
The flow:
- API client calls the Notification Service with notification intent
- Notification Service validates the request, enriches it with user preferences, renders the template
- Priority Queue holds notifications (separate queues for different priorities)
- Channel-specific workers consume from the queue and send via their respective providers
- Analytics tracks delivery status, opens, clicks
- Dead Letter Queue handles failures for later retry or debugging
Deep Dive: Critical Components
Message Prioritization
Not all notifications are equal. An OTP code for security needs to go out in seconds, but a marketing campaign can tolerate minutes of delay.
We use a priority queue with at least three tiers:
| Priority | Examples | Latency | Retry |
|---|---|---|---|
| High (P1) | OTP, password reset, security alerts | under 30 seconds | aggressive (exponential backoff) |
| Standard (P2) | Order confirmation, shipment updates | under 5 minutes | moderate |
| Low (P3) | Marketing campaigns, digest emails | under 1 hour | lenient |
Workers are sized accordingly — we might have 10 workers for P1, 3 for P2, and 1 for P3.
User Preference Service
Users should have control. The User Preference Service stores:
{
"user_id": "12345",
"channels": {
"push": { "enabled": true, "quiet_hours": ["22:00", "08:00"] },
"sms": { "enabled": true, "max_per_day": 5 },
"email": { "enabled": true, "digest": true, "digest_frequency": "daily" },
"in_app": { "enabled": true }
},
"notification_types": {
"marketing": { "enabled": false },
"transactional": { "enabled": true },
"social": { "enabled": true, "channels": ["in_app", "push"] }
}
}
Before sending a notification, we check:
- Is this notification type enabled for this user?
- Is this channel enabled?
- Are we in quiet hours?
- Have we hit the rate limit for this channel today?
Template Rendering
Notifications are templated. We store templates:
template_id: "order_shipped"
subject: "Your order {{order_id}} has shipped!"
body: "Hi {{user_name}}, your package is on the way. Tracking: {{tracking_url}}"
The Notification Service renders these with provided context before queuing.
Deduplication with Idempotency Keys
A critical problem: what if a request is retried? We might send the same notification twice. Solution: idempotency keys.
The client provides an idempotency_key (e.g., sha256(user_id + event_type + timestamp)). We store a mapping:
idempotency_key -> notification_id
If we see the same key again within a window (e.g., 24 hours), we return the existing notification_id without re-queuing.
Retry and Dead Letter Queue
When a notification fails (provider timeout, invalid phone number, etc.), we retry with exponential backoff:
Retry 1: 1 second delay
Retry 2: 2 seconds delay
Retry 3: 4 seconds delay
... (max 5 retries)
After max retries, it goes to the Dead Letter Queue for manual inspection and debugging. A human can then decide to re-queue or investigate why a provider is failing.
Analytics and Delivery Tracking
We need to track:
- Sent: Notification queued and sent to provider
- Delivered: Provider confirmed delivery
- Opened: User opened the push notification or email
- Clicked: User clicked a link in the notification
- Failed: Provider returned an error
This data lives in a time-series database (ClickHouse, Influxdb) or a data warehouse for analysis.
Scaling Considerations
Horizontal Scaling of Workers
As load grows, we scale workers horizontally. With a message queue (Kafka, RabbitMQ), we can add more worker instances consuming from the same queue. The queue handles distribution.
Kafka partition strategy:
- Partition by user_id for in-app notifications (preserves per-user ordering)
- Partition by channel for push/SMS/email (no ordering requirement)
Queue Partitioning by Type
Don’t put everything in one queue. Instead:
notifications.high-priority— for OTP, password resetnotifications.transactional— for order confirmationsnotifications.marketing— for campaignsnotifications.digest— for batched emails
This prevents a spike in marketing notifications from starving critical security alerts.
Channel Isolation
Each channel provider is independent:
- If FCM (Firebase Cloud Messaging) is down, push notifications fail, but email and SMS continue
- If Twilio is down, SMS fails, but push and email are unaffected
We can also use multiple providers for critical channels:
- Primary: FCM, Fallback: OneSignal
- Primary: Twilio, Fallback: AWS SNS
Batching and Digest
For non-urgent notifications, we can batch them:
- Collect notifications for a user over 1 hour
- Compile them into a digest email
- Send once per hour instead of multiple emails
This reduces noise and cost.
Trade-offs and Design Decisions
| Decision | Push Model | Pull Model | Our Choice |
|---|---|---|---|
| Architecture | Push to user immediately | User pulls when ready | Push (most notifications are time-sensitive) |
| Provider reliability | Dependency on third-party uptime | Direct control | Accept third-party, use fallbacks |
| Cost | Pay per notification sent | Pay for storage/bandwidth | Optimize with smart batching |
| Latency | Real-time | Delayed | Push for real-time needs |
Pro tip: Start with a simple architecture (single queue, single worker type) and add complexity only when you hit bottlenecks. Many teams over-engineer notification systems from the start.
Key Takeaways
- Prioritization matters: Distinguish between critical (OTP) and non-critical (marketing) notifications. Use separate queues.
- User control is essential: Preferences, quiet hours, rate limits — users will enable notifications only if they have control.
- Deduplication prevents duplicates: Use idempotency keys to handle retries safely.
- Multi-channel redundancy: If one provider fails, others should work. No single point of failure.
- Eventual consistency: Track delivery asynchronously. Real-time status updates are a nice-to-have, not a must-have.
Practice Exercise
Extend this design to support:
- A/B testing: Send different message variants to different users, measure engagement.
- Timezone-aware scheduling: A marketing campaign should arrive at 9 AM in each user’s local timezone.
- Rate limiting per sender: Prevent one service from overwhelming the queue (token bucket or sliding window counter).
Next up: We move from one-to-many communication (notifications) to one-way broadcast of content (feeds). How do we design a news feed that serves billions of feed requests per day?