System Design FAQ: Top Questions
17. How would you design a Logging and Monitoring System?
A Logging and Monitoring System collects, stores, queries, and visualizes logs and metrics to aid in debugging, observability, and real-time alerting for distributed applications.
📋 Functional Requirements
- Collect structured and unstructured logs
- Visualize metrics over time
- Set up alerting thresholds and anomaly detection
- Enable search across distributed systems
📦 Non-Functional Requirements
- High write throughput and availability
- Log retention and archiving
- Secure and role-based access
🏗️ System Components
- Log Shippers: Fluentd, Logstash, Vector
- Metrics Exporters: Prometheus Node Exporter, custom collectors
- Storage Layer: Loki, Elasticsearch, InfluxDB, VictoriaMetrics
- Dashboard UI: Grafana, Kibana
📂 Log Format (JSON)
{
"timestamp": "2025-06-11T13:00:00Z",
"level": "ERROR",
"service": "billing-service",
"message": "Payment failed",
"user_id": "u5678"
}
📦 Prometheus Exporter (Go)
http.Handle("/metrics", promhttp.Handler())
prometheus.MustRegister(requestCount)
requestCount.WithLabelValues("GET", "/api/pay").Inc()
🔧 Fluent Bit Config Example
[INPUT]
Name tail
Path /var/log/app/*.log
Tag app.logs
Parser json
[OUTPUT]
Name es
Match *
Host elasticsearch.local
Port 9200
📈 Grafana Dashboard Setup
- Data sources: Loki (logs), Prometheus (metrics)
- Panels: Errors per second, request latency histogram
- Alerts: If error count > 5 in 1 min, notify Slack
🛡️ Access Control
- Read/write policies via Grafana teams
- JWT or OAuth integration for secure auth
📚 Retention Policy
- Logs retained for 7 days, archived in S3 after 30 days
- Cold storage with query delay > 5s
📌 Final Insight
Modern observability stacks separate logs, metrics, and traces but unify them at the dashboard layer. Use structured logs for richer filtering and tag your metrics with high-cardinality dimensions cautiously.
