Notification System

The Problem: A Multi-Channel Notification Architecture

Let’s tackle a problem you’ll face in almost every real-world application: how do you reliably send notifications to hundreds of millions of users across multiple channels — push notifications, SMS, email, and in-app messages — without overwhelming your infrastructure or annoying your users?

Imagine you’re building this for a platform with 100M users. You need to send password reset codes, order confirmations, marketing campaigns, security alerts, and social notifications. Each channel has different characteristics: push notifications are fast but only work on mobile apps; SMS is reliable but expensive; email is flexible but slow; in-app is instant but only reaches active users.

This is the notification system interview question. It tests your understanding of message queues, worker processes, rate limiting, third-party integrations, and handling scale gracefully.

Functional and Non-Functional Requirements

Functional requirements:

Send notifications via multiple channels (push, SMS, email, in-app)
Support user preferences (opt-in/opt-out, quiet hours, channel preferences)
Template system for dynamic message content
Schedule notifications for future delivery
Batch or digest related notifications
Track delivery status and failures
Handle retries and dead-letter queues

Non-functional requirements:

Throughput: Handle millions of notifications per hour
Latency: Real-time notifications (like OTP codes) delivered within seconds; batch notifications are flexible
Delivery guarantee: At-least-once delivery semantics
Availability: System should be highly available even if one channel provider is down
User-centric: Configurable per-user (rate limits, preferences, quiet hours)

Scale Estimation

Let’s do the math:

100M users, average 5 notifications per user per day
500M notifications/day
Average: ~5,800 notifications/second
Peak: ~20,000 notifications/second (during high-traffic hours)
Storage: Assume notification metadata retention of 30 days; a notification record with metadata is about 500 bytes, so 500M × 500 bytes = 250 GB per day of storage

This is substantial. We can’t simply process everything synchronously or store everything in a single database.

High-Level Architecture

Here’s how we structure the system:

graph LR
    Client["API Client<br/>(App/Service)"]
    APIGw["API Gateway"]
    NotifService["Notification Service<br/>(Validate, Enrich)"]
    PriorityQueue["Priority Queue<br/>(RabbitMQ/Kafka)"]

    PushWorker["Push Worker<br/>(APNs, FCM)"]
    SMSWorker["SMS Worker<br/>(Twilio, AWS SNS)"]
    EmailWorker["Email Worker<br/>(SendGrid, SES)"]
    InAppWorker["In-App Worker<br/>(Write to DB)"]

    PrefStore["User Preference<br/>Store"]
    TemplateEngine["Template<br/>Engine"]
    Analytics["Analytics<br/>Service"]
    DLQ["Dead Letter<br/>Queue"]

    Client -->|POST /notify| APIGw
    APIGw --> NotifService
    NotifService --> PrefStore
    NotifService --> TemplateEngine
    NotifService --> PriorityQueue

    PriorityQueue -->|High Priority| PushWorker
    PriorityQueue -->|Standard| SMSWorker
    PriorityQueue -->|Standard| EmailWorker
    PriorityQueue -->|Real-time| InAppWorker

    PushWorker --> Analytics
    SMSWorker --> Analytics
    EmailWorker --> Analytics
    InAppWorker --> Analytics

    PushWorker -->|Failed| DLQ
    SMSWorker -->|Failed| DLQ
    EmailWorker -->|Failed| DLQ

The flow:

API client calls the Notification Service with notification intent
Notification Service validates the request, enriches it with user preferences, renders the template
Priority Queue holds notifications (separate queues for different priorities)
Channel-specific workers consume from the queue and send via their respective providers
Analytics tracks delivery status, opens, clicks
Dead Letter Queue handles failures for later retry or debugging

Deep Dive: Critical Components

Message Prioritization

Not all notifications are equal. An OTP code for security needs to go out in seconds, but a marketing campaign can tolerate minutes of delay.

We use a priority queue with at least three tiers:

Priority	Examples	Latency	Retry
High (P1)	OTP, password reset, security alerts	under 30 seconds	aggressive (exponential backoff)
Standard (P2)	Order confirmation, shipment updates	under 5 minutes	moderate
Low (P3)	Marketing campaigns, digest emails	under 1 hour	lenient

Workers are sized accordingly — we might have 10 workers for P1, 3 for P2, and 1 for P3.

User Preference Service

Users should have control. The User Preference Service stores:

{
  "user_id": "12345",
  "channels": {
    "push": { "enabled": true, "quiet_hours": ["22:00", "08:00"] },
    "sms": { "enabled": true, "max_per_day": 5 },
    "email": { "enabled": true, "digest": true, "digest_frequency": "daily" },
    "in_app": { "enabled": true }
  },
  "notification_types": {
    "marketing": { "enabled": false },
    "transactional": { "enabled": true },
    "social": { "enabled": true, "channels": ["in_app", "push"] }
  }
}

Before sending a notification, we check:

Is this notification type enabled for this user?
Is this channel enabled?
Are we in quiet hours?
Have we hit the rate limit for this channel today?

Template Rendering

Notifications are templated. We store templates:

template_id: "order_shipped"
subject: "Your order {{order_id}} has shipped!"
body: "Hi {{user_name}}, your package is on the way. Tracking: {{tracking_url}}"

The Notification Service renders these with provided context before queuing.

Deduplication with Idempotency Keys

A critical problem: what if a request is retried? We might send the same notification twice. Solution: idempotency keys.

The client provides an idempotency_key (e.g., sha256(user_id + event_type + timestamp)). We store a mapping:

idempotency_key -> notification_id

If we see the same key again within a window (e.g., 24 hours), we return the existing notification_id without re-queuing.

Retry and Dead Letter Queue

When a notification fails (provider timeout, invalid phone number, etc.), we retry with exponential backoff:

Retry 1: 1 second delay
Retry 2: 2 seconds delay
Retry 3: 4 seconds delay
... (max 5 retries)

After max retries, it goes to the Dead Letter Queue for manual inspection and debugging. A human can then decide to re-queue or investigate why a provider is failing.

Analytics and Delivery Tracking

We need to track:

Sent: Notification queued and sent to provider
Delivered: Provider confirmed delivery
Opened: User opened the push notification or email
Clicked: User clicked a link in the notification
Failed: Provider returned an error

This data lives in a time-series database (ClickHouse, Influxdb) or a data warehouse for analysis.

Scaling Considerations

Horizontal Scaling of Workers

As load grows, we scale workers horizontally. With a message queue (Kafka, RabbitMQ), we can add more worker instances consuming from the same queue. The queue handles distribution.

Kafka partition strategy:
- Partition by user_id for in-app notifications (preserves per-user ordering)
- Partition by channel for push/SMS/email (no ordering requirement)

Queue Partitioning by Type

Don’t put everything in one queue. Instead:

notifications.high-priority — for OTP, password reset
notifications.transactional — for order confirmations
notifications.marketing — for campaigns
notifications.digest — for batched emails

This prevents a spike in marketing notifications from starving critical security alerts.

Channel Isolation

Each channel provider is independent:

If FCM (Firebase Cloud Messaging) is down, push notifications fail, but email and SMS continue
If Twilio is down, SMS fails, but push and email are unaffected

We can also use multiple providers for critical channels:

Primary: FCM, Fallback: OneSignal
Primary: Twilio, Fallback: AWS SNS

Batching and Digest

For non-urgent notifications, we can batch them:

Collect notifications for a user over 1 hour
Compile them into a digest email
Send once per hour instead of multiple emails

This reduces noise and cost.

Trade-offs and Design Decisions

Decision	Push Model	Pull Model	Our Choice
Architecture	Push to user immediately	User pulls when ready	Push (most notifications are time-sensitive)
Provider reliability	Dependency on third-party uptime	Direct control	Accept third-party, use fallbacks
Cost	Pay per notification sent	Pay for storage/bandwidth	Optimize with smart batching
Latency	Real-time	Delayed	Push for real-time needs

Pro tip: Start with a simple architecture (single queue, single worker type) and add complexity only when you hit bottlenecks. Many teams over-engineer notification systems from the start.

Key Takeaways

Prioritization matters: Distinguish between critical (OTP) and non-critical (marketing) notifications. Use separate queues.
User control is essential: Preferences, quiet hours, rate limits — users will enable notifications only if they have control.
Deduplication prevents duplicates: Use idempotency keys to handle retries safely.
Multi-channel redundancy: If one provider fails, others should work. No single point of failure.
Eventual consistency: Track delivery asynchronously. Real-time status updates are a nice-to-have, not a must-have.

Practice Exercise

Extend this design to support:

A/B testing: Send different message variants to different users, measure engagement.
Timezone-aware scheduling: A marketing campaign should arrive at 9 AM in each user’s local timezone.
Rate limiting per sender: Prevent one service from overwhelming the queue (token bucket or sliding window counter).

Next up: We move from one-to-many communication (notifications) to one-way broadcast of content (feeds). How do we design a news feed that serves billions of feed requests per day?