Observability & SRE: Scenario-Based Questions
86. What's the difference between observability and monitoring, and why does it matter?
Monitoring tells you when something breaks. Observability helps you understand why. Both are essential โ but observability is a mindset and a system design principle.
๐ Monitoring
- Collects predefined metrics and sets static thresholds
- Detects known failures and raises alerts
- Examples: CPU > 80%, HTTP 500 errors, latency spikes
๐ Observability
- Designing systems so internal states can be inferred from external outputs
- Enables root cause analysis for unknown-unknowns
- Requires structured logs, high-cardinality metrics, and traces
๐งฐ Core Pillars of Observability
- Logs: Structured, queryable events with context
- Metrics: Quantitative time-series data (latency, RPS, memory)
- Traces: Distributed flow of a single request across services
๐ ๏ธ Tools
- Prometheus + Grafana for metrics
- ELK stack, Loki, or FluentBit for logs
- OpenTelemetry, Jaeger, or Zipkin for traces
โ Best Practices
- Correlate logs, metrics, and traces using request IDs
- Use RED and USE metrics to monitor service health
- Expose custom business metrics (e.g., orders/minute)
๐ซ Common Pitfalls
- Over-relying on dashboards without alerting
- Too many alerts โ fatigue and ignored warnings
- Storing unstructured logs โ hard to query or correlate
๐ Final Insight
Observability isnโt just about tools โ itโs about insight. Build systems that let you ask โwhatโs happening and why?โ even for failures youโve never seen before.
