Observability & SRE: Scenario-Based Questions

86. What's the difference between observability and monitoring, and why does it matter?

Monitoring tells you when something breaks. Observability helps you understand why. Both are essential — but observability is a mindset and a system design principle.

📈 Monitoring

Collects predefined metrics and sets static thresholds
Detects known failures and raises alerts
Examples: CPU > 80%, HTTP 500 errors, latency spikes

🔍 Observability

Designing systems so internal states can be inferred from external outputs
Enables root cause analysis for unknown-unknowns
Requires structured logs, high-cardinality metrics, and traces

🧰 Core Pillars of Observability

Logs: Structured, queryable events with context
Metrics: Quantitative time-series data (latency, RPS, memory)
Traces: Distributed flow of a single request across services

🛠️ Tools

Prometheus + Grafana for metrics
ELK stack, Loki, or FluentBit for logs
OpenTelemetry, Jaeger, or Zipkin for traces

✅ Best Practices

Correlate logs, metrics, and traces using request IDs
Use RED and USE metrics to monitor service health
Expose custom business metrics (e.g., orders/minute)

🚫 Common Pitfalls

Over-relying on dashboards without alerting
Too many alerts → fatigue and ignored warnings
Storing unstructured logs — hard to query or correlate

📌 Final Insight

Observability isn’t just about tools — it’s about insight. Build systems that let you ask “what’s happening and why?” even for failures you’ve never seen before.

←→