System Design FAQ: Top Questions
39. How would you design a Notification System (Email, SMS, Push)?
A Notification System delivers alerts to users via email, SMS, or push notifications. It should be scalable, reliable, and support retries, templates, and channel preferences.
📋 Functional Requirements
- Multi-channel delivery: Email, SMS, Push
- Retry and dead-letter queue for failures
- Templating engine with variables
- User-specific channel preference support
📦 Non-Functional Requirements
- At-least-once delivery
- Scalability (millions/day)
- Rate-limiting and deduplication
🏗️ Core Components
- Event Producer: Business logic emits notification event
- Message Queue: Kafka or SQS decouples producer/consumer
- Worker Service: Picks messages, personalizes, sends
- Channel Provider: Email (SendGrid), SMS (Twilio), Push (FCM)
📨 Kafka Notification Topic Format
{
"user_id": "u_456",
"channel": "email",
"template_id": "welcome",
"vars": { "name": "Raj", "signup_time": "10:30 AM" }
}
🧩 Templating with Jinja2 (Python)
from jinja2 import Template
template = Template("Hi {{ name }}, welcome! You signed up at {{ signup_time }}.")
msg = template.render(name="Raj", signup_time="10:30 AM")
print(msg)
📲 Channel Provider Integration
- Email: SendGrid, SES — supports batching, templates
- SMS: Twilio — handles country-specific formats
- Push: FCM (Firebase Cloud Messaging)
♻️ Retry Logic with DLQ (AWS SQS)
MainQueue:
RedrivePolicy:
maxReceiveCount: 3
deadLetterTargetArn: arn:aws:sqs:region:acct:DLQ
📄 Notification History Schema (PostgreSQL)
CREATE TABLE notification_log (
id UUID PRIMARY KEY,
user_id TEXT,
channel TEXT,
template_id TEXT,
status TEXT,
sent_at TIMESTAMP
);
📈 Observability
- Success/failure counts by channel
- Retry rate and DLQ volume
- Delivery latency histogram
🧰 Tools/Infra Used
- Queue: Kafka, RabbitMQ, SQS
- Worker: Python/Golang, Celery, Sidekiq
- Email/SMS: SendGrid, Twilio
📌 Final Insight
A well-designed notification system ensures messages are sent reliably, personalized, and respects delivery constraints across channels. Logging, observability, and retries are critical to success.