System Design FAQ: Top Questions
27. How would you design a Webhook Delivery System?
A Webhook Delivery System reliably sends HTTP callbacks (webhooks) to third-party systems when specific events occur. It ensures reliability, retries, signature validation, and observability.
📋 Functional Requirements
- Register URLs for events
- Trigger webhooks asynchronously
- Support retries with backoff
- Allow secure delivery via HMAC
📦 Non-Functional Requirements
- Exactly-once or at-least-once delivery
- Low delivery latency
- Failure isolation and DLQ handling
🏗️ Core Components
- Subscription Registry: Stores webhook URLs per customer
- Dispatcher: Queues and sends event payloads
- Retry Manager: Implements backoff and DLQ
- Validator: Verifies HMAC signature
- Audit Store: Logs delivery attempts
📨 Sample Webhook Payload
POST /webhook HTTP/1.1
Host: client.example.com
X-Signature: sha256=6bf9e5...
Content-Type: application/json
{
"event": "invoice.paid",
"id": "evt_9876",
"data": {
"invoice_id": "inv_1234",
"amount": 4200,
"currency": "USD"
}
}
🔐 Signature Verification (Node.js)
const crypto = require("crypto");
function verifySignature(payload, header, secret) {
const hmac = crypto.createHmac("sha256", secret);
hmac.update(payload);
const digest = "sha256=" + hmac.digest("hex");
return crypto.timingSafeEqual(Buffer.from(header), Buffer.from(digest));
}
📤 Retry Strategy
- Immediate retry (1st failure), then exponential backoff (2s, 10s, 60s…)
- After 5 failures → move to DLQ with reason
📁 DLQ Storage Schema
{
"webhook_id": "evt_9876",
"url": "https://client.example.com/hook",
"status": "failed",
"retries": 5,
"error": "Timeout",
"last_attempt": "2025-06-11T13:01:00Z"
}
📈 Metrics to Track
- Successful deliveries / failures
- Retry rate and DLQ growth
- Delivery latency percentiles
🧰 Tools/Infra Used
- Queue: Kafka / RabbitMQ / SQS
- Dispatcher: FastAPI / Node.js + Axios
- DB: PostgreSQL / DynamoDB
- Monitoring: Prometheus + Grafana
📌 Final Insight
A robust webhook system handles retries, verification, and alerting. Delivery should be asynchronous and resilient with strong observability and a DLQ for failure analysis. Signature validation ensures security.
