Observability & Logging: Scenario-Based Questions

68. How do you manage logging and log storage at scale in distributed systems?

Logging is critical for debugging and observability, but high-volume systems can easily generate terabytes of logs daily. Scalable, cost-effective strategies are key to making logs useful and sustainable.

📦 Key Challenges

Storage costs and retention compliance.
Log noise vs signal — drowning in debug output.
Search and aggregation speed.

🧰 Typical Architecture

Log Forwarders: Fluentd, Filebeat, Vector, CloudWatch Agents.
Ingestion Pipelines: Kafka → Logstash or Kinesis Firehose.
Storage: Elasticsearch/OpenSearch, Loki, BigQuery, S3 + Athena.
Visualization: Grafana, Kibana, Datadog Logs, Splunk.

🔍 Logging Practices

Use structured logs (JSON) — easier to parse and search.
Log with context — userID, requestID, tenantID.
Tag logs by service, region, environment.
Avoid logging sensitive data (PII, secrets).

📉 Cost Controls

Log sampling and aggregation (e.g., only errors + sampled info logs).
Short retention for verbose logs (e.g., 7d for debug, 90d for errors).
Cold storage tiering (e.g., S3 Glacier for long-term audit trails).

✅ Best Practices

Define log levels clearly (debug, info, warn, error, fatal).
Include correlation IDs to trace across services.
Auto-archive or delete old logs based on retention rules.
Alert on error spikes via log queries.

🚫 Common Pitfalls

High cardinality fields (e.g., raw UUIDs in metrics/log labels).
Sending unstructured logs into query-optimized backends.
Not alerting on log ingestion failures or pipeline backpressure.

📌 Final Insight

Logs are your system’s memory — but memory must be filtered, organized, and managed. Build pipelines that make logs actionable, affordable, and searchable in real-time.

←→