System Design FAQ: Top Questions
9. How would you design a Real-Time Chat System (like WhatsApp or Slack)?
A Real-Time Chat System facilitates instant message exchange between users or groups with high reliability, low latency, and support for multimedia, typing indicators, and delivery acknowledgments.
📋 Functional Requirements
- 1:1 and group messaging
- Message delivery & read receipts
- Typing indicators
- Online/offline presence
📦 Non-Functional Requirements
- Low latency (<100ms)
- High availability and fault tolerance
- Scalable to millions of concurrent connections
🏗️ Architecture Overview
- Frontend: Web/Mobile clients using WebSockets or long-polling
- Gateway: Handles auth, routing, and user sessions
- Message Broker: Kafka/PubSub to decouple producers and consumers
- Chat Service: Core messaging logic
- Message Store: DB for chat history (Cassandra, MongoDB)
📤 WebSocket Communication (Node.js)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', function connection(ws) {
ws.on('message', function incoming(message) {
console.log('received:', message);
ws.send('echo: ' + message);
});
ws.send('Welcome to chat server!');
});
🗃️ Message Schema (MongoDB)
{
"message_id": "uuid",
"sender_id": "user123",
"receiver_id": "user456",
"timestamp": "2024-06-10T18:00:00Z",
"message_type": "text", // or image, video
"payload": "Hello there!",
"status": "delivered" // or sent/read
}
📈 Delivery Semantics
- Send → Ack → Delivered → Read: Phased state tracking
- Message Queue: Retry logic for offline users
- Store and forward: Buffer undelivered messages in Redis or Kafka
☁️ Redis for Presence Management
SET user:123:online 1 EX 30
GET user:123:online → 1 (online) or nil (offline)
👥 Group Chat Considerations
- Fan-out to all members using Kafka or topic queues
- Deduplication of message delivery
- Limit max group size or shard large groups
📊 Observability
- Track message latency (send → delivered → read)
- Monitor WebSocket uptime and errors
- Log undelivered messages for recovery
🔐 Security
- Token-based WebSocket authentication (JWT)
- Encrypt messages at-rest and in-transit
- Rate limiting for spam detection
📌 Final Insight
A real-time chat system is a classic case of low-latency, stateful infrastructure. Resilience comes from decoupling, tracking message state transitions, and using pub/sub messaging and database fallback.
