System Design Fundamentals

Notification System

A

Notification System

The Problem: A Multi-Channel Notification Architecture

Let’s tackle a problem you’ll face in almost every real-world application: how do you reliably send notifications to hundreds of millions of users across multiple channels — push notifications, SMS, email, and in-app messages — without overwhelming your infrastructure or annoying your users?

Imagine you’re building this for a platform with 100M users. You need to send password reset codes, order confirmations, marketing campaigns, security alerts, and social notifications. Each channel has different characteristics: push notifications are fast but only work on mobile apps; SMS is reliable but expensive; email is flexible but slow; in-app is instant but only reaches active users.

This is the notification system interview question. It tests your understanding of message queues, worker processes, rate limiting, third-party integrations, and handling scale gracefully.

Functional and Non-Functional Requirements

Functional requirements:

  • Send notifications via multiple channels (push, SMS, email, in-app)
  • Support user preferences (opt-in/opt-out, quiet hours, channel preferences)
  • Template system for dynamic message content
  • Schedule notifications for future delivery
  • Batch or digest related notifications
  • Track delivery status and failures
  • Handle retries and dead-letter queues

Non-functional requirements:

  • Throughput: Handle millions of notifications per hour
  • Latency: Real-time notifications (like OTP codes) delivered within seconds; batch notifications are flexible
  • Delivery guarantee: At-least-once delivery semantics
  • Availability: System should be highly available even if one channel provider is down
  • User-centric: Configurable per-user (rate limits, preferences, quiet hours)

Scale Estimation

Let’s do the math:

  • 100M users, average 5 notifications per user per day
  • 500M notifications/day
  • Average: ~5,800 notifications/second
  • Peak: ~20,000 notifications/second (during high-traffic hours)
  • Storage: Assume notification metadata retention of 30 days; a notification record with metadata is about 500 bytes, so 500M × 500 bytes = 250 GB per day of storage

This is substantial. We can’t simply process everything synchronously or store everything in a single database.

High-Level Architecture

Here’s how we structure the system:

graph LR
    Client["API Client<br/>(App/Service)"]
    APIGw["API Gateway"]
    NotifService["Notification Service<br/>(Validate, Enrich)"]
    PriorityQueue["Priority Queue<br/>(RabbitMQ/Kafka)"]

    PushWorker["Push Worker<br/>(APNs, FCM)"]
    SMSWorker["SMS Worker<br/>(Twilio, AWS SNS)"]
    EmailWorker["Email Worker<br/>(SendGrid, SES)"]
    InAppWorker["In-App Worker<br/>(Write to DB)"]

    PrefStore["User Preference<br/>Store"]
    TemplateEngine["Template<br/>Engine"]
    Analytics["Analytics<br/>Service"]
    DLQ["Dead Letter<br/>Queue"]

    Client -->|POST /notify| APIGw
    APIGw --> NotifService
    NotifService --> PrefStore
    NotifService --> TemplateEngine
    NotifService --> PriorityQueue

    PriorityQueue -->|High Priority| PushWorker
    PriorityQueue -->|Standard| SMSWorker
    PriorityQueue -->|Standard| EmailWorker
    PriorityQueue -->|Real-time| InAppWorker

    PushWorker --> Analytics
    SMSWorker --> Analytics
    EmailWorker --> Analytics
    InAppWorker --> Analytics

    PushWorker -->|Failed| DLQ
    SMSWorker -->|Failed| DLQ
    EmailWorker -->|Failed| DLQ

The flow:

  1. API client calls the Notification Service with notification intent
  2. Notification Service validates the request, enriches it with user preferences, renders the template
  3. Priority Queue holds notifications (separate queues for different priorities)
  4. Channel-specific workers consume from the queue and send via their respective providers
  5. Analytics tracks delivery status, opens, clicks
  6. Dead Letter Queue handles failures for later retry or debugging

Deep Dive: Critical Components

Message Prioritization

Not all notifications are equal. An OTP code for security needs to go out in seconds, but a marketing campaign can tolerate minutes of delay.

We use a priority queue with at least three tiers:

PriorityExamplesLatencyRetry
High (P1)OTP, password reset, security alertsunder 30 secondsaggressive (exponential backoff)
Standard (P2)Order confirmation, shipment updatesunder 5 minutesmoderate
Low (P3)Marketing campaigns, digest emailsunder 1 hourlenient

Workers are sized accordingly — we might have 10 workers for P1, 3 for P2, and 1 for P3.

User Preference Service

Users should have control. The User Preference Service stores:

{
  "user_id": "12345",
  "channels": {
    "push": { "enabled": true, "quiet_hours": ["22:00", "08:00"] },
    "sms": { "enabled": true, "max_per_day": 5 },
    "email": { "enabled": true, "digest": true, "digest_frequency": "daily" },
    "in_app": { "enabled": true }
  },
  "notification_types": {
    "marketing": { "enabled": false },
    "transactional": { "enabled": true },
    "social": { "enabled": true, "channels": ["in_app", "push"] }
  }
}

Before sending a notification, we check:

  • Is this notification type enabled for this user?
  • Is this channel enabled?
  • Are we in quiet hours?
  • Have we hit the rate limit for this channel today?

Template Rendering

Notifications are templated. We store templates:

template_id: "order_shipped"
subject: "Your order {{order_id}} has shipped!"
body: "Hi {{user_name}}, your package is on the way. Tracking: {{tracking_url}}"

The Notification Service renders these with provided context before queuing.

Deduplication with Idempotency Keys

A critical problem: what if a request is retried? We might send the same notification twice. Solution: idempotency keys.

The client provides an idempotency_key (e.g., sha256(user_id + event_type + timestamp)). We store a mapping:

idempotency_key -> notification_id

If we see the same key again within a window (e.g., 24 hours), we return the existing notification_id without re-queuing.

Retry and Dead Letter Queue

When a notification fails (provider timeout, invalid phone number, etc.), we retry with exponential backoff:

Retry 1: 1 second delay
Retry 2: 2 seconds delay
Retry 3: 4 seconds delay
... (max 5 retries)

After max retries, it goes to the Dead Letter Queue for manual inspection and debugging. A human can then decide to re-queue or investigate why a provider is failing.

Analytics and Delivery Tracking

We need to track:

  • Sent: Notification queued and sent to provider
  • Delivered: Provider confirmed delivery
  • Opened: User opened the push notification or email
  • Clicked: User clicked a link in the notification
  • Failed: Provider returned an error

This data lives in a time-series database (ClickHouse, Influxdb) or a data warehouse for analysis.

Scaling Considerations

Horizontal Scaling of Workers

As load grows, we scale workers horizontally. With a message queue (Kafka, RabbitMQ), we can add more worker instances consuming from the same queue. The queue handles distribution.

Kafka partition strategy:
- Partition by user_id for in-app notifications (preserves per-user ordering)
- Partition by channel for push/SMS/email (no ordering requirement)

Queue Partitioning by Type

Don’t put everything in one queue. Instead:

  • notifications.high-priority — for OTP, password reset
  • notifications.transactional — for order confirmations
  • notifications.marketing — for campaigns
  • notifications.digest — for batched emails

This prevents a spike in marketing notifications from starving critical security alerts.

Channel Isolation

Each channel provider is independent:

  • If FCM (Firebase Cloud Messaging) is down, push notifications fail, but email and SMS continue
  • If Twilio is down, SMS fails, but push and email are unaffected

We can also use multiple providers for critical channels:

  • Primary: FCM, Fallback: OneSignal
  • Primary: Twilio, Fallback: AWS SNS

Batching and Digest

For non-urgent notifications, we can batch them:

  • Collect notifications for a user over 1 hour
  • Compile them into a digest email
  • Send once per hour instead of multiple emails

This reduces noise and cost.

Trade-offs and Design Decisions

DecisionPush ModelPull ModelOur Choice
ArchitecturePush to user immediatelyUser pulls when readyPush (most notifications are time-sensitive)
Provider reliabilityDependency on third-party uptimeDirect controlAccept third-party, use fallbacks
CostPay per notification sentPay for storage/bandwidthOptimize with smart batching
LatencyReal-timeDelayedPush for real-time needs

Pro tip: Start with a simple architecture (single queue, single worker type) and add complexity only when you hit bottlenecks. Many teams over-engineer notification systems from the start.

Key Takeaways

  1. Prioritization matters: Distinguish between critical (OTP) and non-critical (marketing) notifications. Use separate queues.
  2. User control is essential: Preferences, quiet hours, rate limits — users will enable notifications only if they have control.
  3. Deduplication prevents duplicates: Use idempotency keys to handle retries safely.
  4. Multi-channel redundancy: If one provider fails, others should work. No single point of failure.
  5. Eventual consistency: Track delivery asynchronously. Real-time status updates are a nice-to-have, not a must-have.

Practice Exercise

Extend this design to support:

  • A/B testing: Send different message variants to different users, measure engagement.
  • Timezone-aware scheduling: A marketing campaign should arrive at 9 AM in each user’s local timezone.
  • Rate limiting per sender: Prevent one service from overwhelming the queue (token bucket or sliding window counter).

Next up: We move from one-to-many communication (notifications) to one-way broadcast of content (feeds). How do we design a news feed that serves billions of feed requests per day?