System Design FAQ: Top Questions

17. How would you design a Logging and Monitoring System?

A Logging and Monitoring System collects, stores, queries, and visualizes logs and metrics to aid in debugging, observability, and real-time alerting for distributed applications.

📋 Functional Requirements

Collect structured and unstructured logs
Visualize metrics over time
Set up alerting thresholds and anomaly detection
Enable search across distributed systems

📦 Non-Functional Requirements

High write throughput and availability
Log retention and archiving
Secure and role-based access

🏗️ System Components

Log Shippers: Fluentd, Logstash, Vector
Metrics Exporters: Prometheus Node Exporter, custom collectors
Storage Layer: Loki, Elasticsearch, InfluxDB, VictoriaMetrics
Dashboard UI: Grafana, Kibana

📂 Log Format (JSON)


{
  "timestamp": "2025-06-11T13:00:00Z",
  "level": "ERROR",
  "service": "billing-service",
  "message": "Payment failed",
  "user_id": "u5678"
}

📦 Prometheus Exporter (Go)


http.Handle("/metrics", promhttp.Handler())
prometheus.MustRegister(requestCount)
requestCount.WithLabelValues("GET", "/api/pay").Inc()

🔧 Fluent Bit Config Example


[INPUT]
  Name tail
  Path /var/log/app/*.log
  Tag app.logs
  Parser json

[OUTPUT]
  Name  es
  Match *
  Host  elasticsearch.local
  Port  9200

📈 Grafana Dashboard Setup

Data sources: Loki (logs), Prometheus (metrics)
Panels: Errors per second, request latency histogram
Alerts: If error count > 5 in 1 min, notify Slack

🛡️ Access Control

Read/write policies via Grafana teams
JWT or OAuth integration for secure auth

📚 Retention Policy

Logs retained for 7 days, archived in S3 after 30 days
Cold storage with query delay > 5s

📌 Final Insight

Modern observability stacks separate logs, metrics, and traces but unify them at the dashboard layer. Use structured logs for richer filtering and tag your metrics with high-cardinality dimensions cautiously.

←→