System Design FAQ: Top Questions

39. How would you design a Notification System (Email, SMS, Push)?

A Notification System delivers alerts to users via email, SMS, or push notifications. It should be scalable, reliable, and support retries, templates, and channel preferences.

📋 Functional Requirements

Multi-channel delivery: Email, SMS, Push
Retry and dead-letter queue for failures
Templating engine with variables
User-specific channel preference support

📦 Non-Functional Requirements

At-least-once delivery
Scalability (millions/day)
Rate-limiting and deduplication

🏗️ Core Components

Event Producer: Business logic emits notification event
Message Queue: Kafka or SQS decouples producer/consumer
Worker Service: Picks messages, personalizes, sends
Channel Provider: Email (SendGrid), SMS (Twilio), Push (FCM)

📨 Kafka Notification Topic Format


{
  "user_id": "u_456",
  "channel": "email",
  "template_id": "welcome",
  "vars": { "name": "Raj", "signup_time": "10:30 AM" }
}

🧩 Templating with Jinja2 (Python)


from jinja2 import Template

template = Template("Hi {{ name }}, welcome! You signed up at {{ signup_time }}.")
msg = template.render(name="Raj", signup_time="10:30 AM")
print(msg)

📲 Channel Provider Integration

Email: SendGrid, SES — supports batching, templates
SMS: Twilio — handles country-specific formats
Push: FCM (Firebase Cloud Messaging)

♻️ Retry Logic with DLQ (AWS SQS)


MainQueue:
  RedrivePolicy:
    maxReceiveCount: 3
    deadLetterTargetArn: arn:aws:sqs:region:acct:DLQ

📄 Notification History Schema (PostgreSQL)


CREATE TABLE notification_log (
  id UUID PRIMARY KEY,
  user_id TEXT,
  channel TEXT,
  template_id TEXT,
  status TEXT,
  sent_at TIMESTAMP
);

📈 Observability

Success/failure counts by channel
Retry rate and DLQ volume
Delivery latency histogram

🧰 Tools/Infra Used

Queue: Kafka, RabbitMQ, SQS
Worker: Python/Golang, Celery, Sidekiq
Email/SMS: SendGrid, Twilio

📌 Final Insight

A well-designed notification system ensures messages are sent reliably, personalized, and respects delivery constraints across channels. Logging, observability, and retries are critical to success.

←→