System DesignMarch 3, 2026·12 min read
Design a Notification System: System Design Interview Guide
Design multi-channel notifications with priority routing, rate limiting, and user preferences
Why Design a Notification System?
Notification systems are asked frequently because they touch on event-driven architecture, message queues, user preferences, and multi-channel delivery. Every large application has one, and designing it well requires balancing reliability with user experience.
- Multi-channel delivery: Push, email, SMS, in-app — each with different constraints
- Priority and routing: Urgent alerts vs marketing — different SLAs
- Rate limiting: Don't spam users, respect preferences
- Reliability: Critical notifications (2FA, payment) must never be lost
Step 1: Requirements
Functional Requirements
Core features:
- Send notifications via multiple channels (push, email, SMS, in-app)
- User preference management (opt-in/out per channel per type)
- Priority levels (critical, high, medium, low)
- Template management (reusable notification templates)
- Delivery tracking and analytics
Out of scope:
- Content management / marketing campaigns
- A/B testing of notification content
- Rich media notifications (images, actions)
- Scheduling (send at user's local morning)Non-Functional Requirements
Scale:
- 500M users
- 10B notifications per day
- Peak: 500K notifications/second
Performance:
- Critical notifications (2FA, alerts): < 5 seconds delivery
- Standard notifications: < 30 seconds
- Marketing: within 1 hour
Reliability:
- Critical: exactly-once delivery, 99.99% success rate
- Standard: at-least-once, best-effort
- No duplicate notifications to users
Key insight: Not all notifications are equal.
Priority-based routing with different SLA guarantees.Step 2: High-Level Architecture
┌──────────────────────────────────────────────────────────────┐
│ NOTIFICATION SOURCES │
│ (Services that trigger notifications via API/events) │
└───────────────────────┬──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ NOTIFICATION SERVICE │
│ Validate, enrich, check preferences, route │
└───────────────────────┬──────────────────────────────────────┘
│
┌─────────┼─────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌─────┐ ┌──────┐
│ CRITICAL │ │HIGH │ │ LOW │
│ QUEUE │ │QUEUE│ │QUEUE │
└────┬─────┘ └──┬──┘ └──┬───┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────┐
│ DELIVERY WORKERS │
│ (Channel-specific: push, email, SMS, in-app) │
└──────┬──────────┬──────────┬──────────┬──────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ APNs │ │ SES / │ │Twilio │ │WebSocket│
│ FCM │ │SendGrid│ │ │ │ SSE │
└────────┘ └────────┘ └────────┘ └────────┘Step 3: Notification Processing Pipeline
When a service triggers a notification:
1. VALIDATION
- Verify required fields (user_id, type, content)
- Check notification type exists in template registry
- Validate channel-specific requirements (email needs subject, etc.)
2. USER PREFERENCE CHECK
- Query user preferences: "Does this user want push for this type?"
- Check quiet hours (don't send at 3am unless critical)
- Check rate limits (max 5 marketing per day)
3. TEMPLATE RENDERING
- Load template for notification type
- Inject user-specific variables (name, data)
- Render per-channel variants (push is short, email is long)
4. DEDUPLICATION
- Hash: (user_id, notification_type, content_hash, time_window)
- If duplicate exists within window → skip
- Prevents "liked your post" x 50 in 1 minute
5. PRIORITY ROUTING
- Critical (2FA, payment) → high-priority queue, dedicated workers
- Standard (social, updates) → normal queue
- Marketing (promotions) → low-priority queue, rate-limited
6. CHANNEL DELIVERY
- Push: send to APNs (iOS) or FCM (Android)
- Email: send via SES/SendGrid
- SMS: send via Twilio
- In-app: write to user's notification inbox + WebSocket pushStep 4: Deep Dive — Priority Queues
Different priorities need different treatment:
CRITICAL (2FA codes, payment alerts, security):
- Dedicated queue with dedicated workers
- No batching — process immediately
- Retry aggressively (3 retries, 1 second apart)
- Dead letter queue → page on-call if DLQ grows
- SLA: < 5 seconds, 99.99% delivery
HIGH (social interactions, comments, follows):
- Standard queue with auto-scaling workers
- Batch processing allowed (up to 100ms batches)
- Retry with backoff (3 retries, exponential)
- SLA: < 30 seconds, 99.9% delivery
LOW (marketing, weekly digests, recommendations):
- Low-priority queue, processed during off-peak
- Heavy rate limiting (max 3 per user per day)
- No retry on failure (best-effort)
- SLA: within 1 hour, 95% delivery
Implementation: Kafka with separate topics per priority
- notification.critical → 20 partitions, 20 consumers
- notification.standard → 50 partitions, auto-scaled consumers
- notification.marketing → 10 partitions, rate-limited consumersStep 5: Rate Limiting
Rate limiting prevents notification fatigue:
Per-user limits:
- Critical: unlimited (safety-critical)
- Social: max 20 per hour, aggregate if exceeded
- Marketing: max 3 per day
Aggregation strategy:
When rate limit hit, aggregate instead of dropping:
- "Alice, Bob, and 12 others liked your post"
- Collect events in buffer, send aggregated after cooldown
Implementation:
- Redis sliding window counter per (user_id, notification_type)
- Key: ratelimit:{user_id}:{type}:{hour_bucket}
- INCR on each notification, check against limit
- If exceeded: add to aggregation buffer
- Background job flushes aggregation buffers every 5 minutes
Global rate limits:
- Per-channel: respect provider limits (FCM: 1000/sec per project)
- Per-sender: prevent noisy services from starving others
- Implement token bucket at the delivery worker levelStep 6: Delivery Tracking
Track every notification through its lifecycle:
States:
CREATED → QUEUED → SENT → DELIVERED → READ → CLICKED
↓
FAILED → RETRIED → SENT (or DEAD_LETTERED)
For each state transition:
- Write event to analytics pipeline (Kafka → data warehouse)
- Update notification status in database
- Push real-time metrics to monitoring dashboard
Delivery receipts:
- Push: APNs/FCM provide delivery receipts
- Email: track opens (pixel tracking) and clicks (link wrapping)
- SMS: Twilio provides delivery status webhooks
- In-app: mark as read when user opens notification panel
Failure handling:
- Invalid device token → mark token as invalid, remove
- Bounced email → mark email as invalid after 3 bounces
- SMS delivery failure → retry once, then skip
- All channels failed → log for investigationStep 7: User Preferences
User preference model:
preferences = {
user_id: "user_123",
channels: {
push: { enabled: true, quiet_hours: "22:00-07:00" },
email: { enabled: true, frequency: "instant" },
sms: { enabled: false },
in_app: { enabled: true }
},
types: {
social: { push: true, email: false, sms: false },
security: { push: true, email: true, sms: true },
marketing: { push: false, email: true, sms: false },
updates: { push: true, email: "weekly_digest", sms: false }
}
}
Storage:
- PostgreSQL for user preferences (read-heavy, cache in Redis)
- Default preferences per notification type (fallback)
- Global opt-out takes precedence over everything
Quiet hours:
- Store in user's timezone
- Convert to UTC at send time
- Critical notifications bypass quiet hoursKey Takeaways for the Interview
- Not all notifications are equal: Priority-based routing with different SLAs is the core insight
- Rate limiting prevents fatigue: Aggregate instead of drop when limits are hit
- Deduplication matters: Hash-based dedup within time windows prevents spam
- User preferences are complex: Per-channel, per-type, with quiet hours and frequency options
- Track everything: Full lifecycle tracking enables debugging and optimization
Practice This on HireReady
Notification system design appears at Meta, Amazon, Uber, and most SaaS companies. Practice explaining event-driven architecture with our AI voice interviewer.